Predicting Business Loan Arrears: An Exploratory and Inferential Analytics Study of a Nigerian Commercial Lending Portfolio (May 2025 – April 2026)
Author
[Anurika Orabuche]
Published
May 18, 2026
1. Executive Summary
This study analyses 629 business loans disbursed by a Nigerian commercial lending institution between May 2025 and April 2026, representing a total portfolio value of ₦38.93 billion over a twelve-month period. Of these, 331 loans remain open (Active or In Arrears) at the time of data extraction — their absence of a closure date reflects current open status, not a data entry error. With 76 loans (12.08%, totalling approximately ₦2.94 billion) currently in arrears, credit risk management has become a central operational priority as monthly disbursements grew from ₦2.07 billion in August 2025 to ₦6.03 billion in April 2026.
Five analytical techniques are applied to this real dataset, which introduces Number of Installments as a key new variable. Exploratory data analysis reveals that loan amounts are severely right-skewed (median ₦35M vs mean ₦61.9M) and identifies data quality issues including activation timestamps with time-of-day components. Data visualisation tells a coherent portfolio story: arrears concentrate in standard Business Loans (14.2%), loans with 12 installments (33.3%), and facilities below ₦10M (42.9%). Hypothesis testing confirms that interest rates differ significantly across loan products (Kruskal-Wallis p < 0.001), and delivers an important null result — loan lifecycle stage is not associated with arrears (p = 0.547) — preventing a misguided policy intervention. Correlation analysis uncovers a strong negative rate–amount relationship (ρ = −0.563) and a meaningful installments–rate correlation (ρ = +0.372). Logistic regression identifies Number of Installments (OR = 1.32, p < 0.001) and log Loan Amount (OR = 0.64, p = 0.015) as the two statistically significant pre-disbursement predictors of arrears, producing a model AUC of 0.668.
The primary recommendation is that the organisation introduce a pre-disbursement review trigger for loans with nine or more installments and loans below ₦10M — the two most concentrated arrears risk dimensions identified across all five techniques.
2. Professional Disclosure
Job Title: [Head of Internal Control and Audit] Organisation Type: Nigerian Commercial Lending Institution Sector: Financial Services / Business Credit
Technique Justifications
Exploratory Data Analysis (EDA): In my role I review loan application files and monthly portfolio performance reports. Before drawing any conclusions I must audit the quality and shape of the data. This dataset contains activation timestamps with time-of-day precision and a severely right-skewed loan amount distribution — both of which require deliberate pre-processing decisions. Critically, the dataset’s missing closure dates reflect open loans rather than data entry errors; documenting this correctly prevents misinterpretation of 331 still-active facilities as missing data.
Data Visualisation: Credit committee presentations and board risk briefings demand visual storytelling. A chart showing that 12-installment loans carry a 33.3% arrears rate — nearly three times the portfolio average — communicates risk concentration in one glance. In my day-to-day role, monthly portfolio dashboards are the primary briefing tool for senior leadership, and the five-plot narrative here directly mirrors that professional context.
Hypothesis Testing: Before recommending policy changes I must establish statistical rigour. The null result — that loan lifecycle stage does not predict arrears (p = 0.547) — is as valuable as a significant result; it prevents the organisation from introducing an approval friction point (screening by New/Renewal/Top Up status) that has no evidential basis in this data.
Correlation Analysis: Understanding how loan size, pricing, and installment structure relate to one another is central to every credit strategy discussion I participate in. The positive correlation between installment count and interest rate (ρ = +0.372) reveals that our product design already partially embeds installment risk — but this has not yet been translated into an explicit pre-disbursement flag.
Logistic Regression: The operational goal is a pre-disbursement risk score: given the observable characteristics of a loan at approval time, what is the probability it falls into arrears? The model’s installment coefficient (OR = 1.32 per additional installment) and amount coefficient (OR = 0.64 per log-unit increase) provide two concrete rules that can be embedded in the loan approval workflow today.
3. Data Collection & Sampling
3.1 Source and Collection Method
The dataset was extracted from the organisation’s internal Loan Management System (LMS), which records all loan approvals, disbursements, and repayment events. The extract was generated via a direct database query covering all loans with an Activation Date between 1 May 2025 and 30 April 2026, providing a clean twelve-month analytical window. Data was exported as a Microsoft Excel file and loaded into R for analysis. Loan identifiers have been replaced with neutral sequential codes (Loan_001, Loan_002, …) and the organisation name is withheld; no other data has been modified. No external or publicly available datasets supplement this analysis.
3.2 Sampling Frame
The sampling frame is the complete population of business loans disbursed during the twelve-month study window — a census of all 629 loan records activated in the LMS between May 2025 and April 2026. No random sampling was required; all disbursed loans within the date window are included, making the analysis free of sampling bias at the data collection stage.
Date loan was fully repaid — NA means the loan is still open
account_state
Categorical
Closed, Active, or In Arrears
loan_name
Categorical
Product name (4 types)
loan_amount
Numeric (₦)
Principal disbursed
interest_rate
Numeric (%)
Monthly interest rate
num_installments
Integer
Number of scheduled repayment installments
loan_type
Categorical
Lifecycle stage: New, Renewal, Top Up
duration_days
Numeric
Days from activation to closure (NA = still open)
days_to_first_repayment
Numeric
Days from activation to first repayment (derived)
in_arrears
Binary 0/1
1 if account_state == “In Arrears” (derived outcome)
loan_open
Binary 0/1
1 if the loan is still open — Active or In Arrears (derived)
3.4 Note on Missing Closed Dates
A missing closed_date does not indicate a data quality problem. It means the loan is still open at the time of extraction — either Active (repayments ongoing) or In Arrears (repayments overdue). Of the 629 loans, 298 are Closed (have a recorded closure date) and 331 are still open (255 Active + 76 In Arrears). This is correctly reflected in the loan_open variable and in all analyses throughout this document.
3.5 Time Period and Ethics
Period: 1 May 2025 to 30 April 2026 (12 months). All data relates to corporate business loan accounts; no personally identifiable information (PII) relating to natural persons is included. Loan identifiers have been anonymised. The organisation name is withheld to protect commercial confidentiality. Analysis was conducted with the knowledge and approval of the relevant department head. Data is available on request from the author.
Exploratory Data Analysis (EDA), formalised by Tukey (1977), uses statistical summaries and graphical displays to understand a dataset’s structure before applying formal models. Key activities include distributional assessment, outlier detection, missing-value classification, and data quality auditing. Adi (2026) illustrates with Anscombe’s Quartet that datasets with identical summary statistics can exhibit radically different structures when visualised — making combined numeric and visual EDA indispensable before any inferential work.
5.2 Business Justification
A data audit is the first step in every portfolio review I conduct. Two issues in this dataset require deliberate decisions: activation timestamps include time-of-day components that must be truncated for monthly aggregation, and missing closure dates must be correctly interpreted as open loans rather than data entry errors — a distinction with significant implications for portfolio monitoring.
# Demonstrate that missing closed_date = still openopen_status <- df |>mutate(closed_date_present =!is.na(closed_date)) |>count(account_state, closed_date_present) |>arrange(account_state)kable(open_status,col.names =c("Account State","Has Closed Date","Count"),caption ="Table 4: Closed date presence by account state — confirms missing = still open") |>kable_styling(bootstrap_options =c("striped","hover"), full_width =FALSE) |>row_spec(which(!open_status$closed_date_present), background ="#FFF3CD")
Table 4: Closed date presence by account state — confirms missing = still open
Account State
Has Closed Date
Count
Active
FALSE
255
Closed
TRUE
298
In Arrears
FALSE
76
Code
library(naniar)df |>miss_var_summary() |>filter(n_miss >0) |>mutate(interpretation =case_when( variable =="closed_date"~"Expected — loan still open (Active or In Arrears)", variable =="duration_days"~"Expected — derived from closed_date; NA for open loans",TRUE~"Review required" )) |>kable(digits =2,caption ="Table 5: Missing value summary with interpretation") |>kable_styling(bootstrap_options =c("striped","hover"), full_width =FALSE)
Table 5: Missing value summary with interpretation
variable
n_miss
pct_miss
interpretation
closed_date
331
52.6
Expected — loan still open (Active or In Arrears)
duration_days
331
52.6
Expected — derived from closed_date; NA for open loans
Code
vis_miss(df |>select(loan_amount, interest_rate, num_installments, days_to_first_repayment, closed_date, duration_days)) +labs(title ="Figure 1: Missing value map",subtitle ="closed_date and duration_days are missing for all 331 still-open loans — not a data quality problem" ) +theme_minimal(base_size =11)
Timestamps stored with HH:MM:SS precision (e.g. “2025-10-04 21:17:46”)
Applied floor_date(activation_date, "month") in R; dt.to_period("M") in Python
2
Missing Closed Date (331 / 52.6%)
No closure date recorded for Active (255) and In Arrears (76) loans
Correctly interpreted as still-open loans, not missing data. Confirmed by cross-checking with account_state. Documented via loan_open variable.
No Loan Type encoding issues. Loan Type contains exactly three clean, consistent categories — New (296), Renewal (213), Top Up (120). No standardisation required.
5.5 Business Interpretation
Three findings have direct operational consequences. First, the timestamp issue is a reporting system artefact requiring IT attention — loan activation timestamps should record date only, not server time with HH:MM:SS precision, to avoid confusion in monthly reports. Second, the right-skewed loan amount distribution (skewness = 1.943) means the mean (₦61.9M) overstates the typical facility by 77% relative to the median (₦35M) — any board report using average loan size will misrepresent the portfolio’s typical credit exposure. Third, the Number of Installments distribution is concentrated at 6 (73.9% of loans), with risk-significant long tails: 12-installment loans carry a 33.3% arrears rate and 9-installment loans carry 20.7% — both well above the 12.1% portfolio average. These high-installment facilities represent a disproportionate share of credit risk despite their small volume share.
6. Technique 2 — Data Visualisation
6.1 Theory
The grammar of graphics (Wilkinson, 2005), implemented in R’s ggplot2, provides a principled framework for constructing statistical graphics by mapping data attributes to visual channels — position, colour, size, and shape. Effective business visualisation requires matching chart type to the data relationship being shown, minimising non-data ink, and ensuring the key message is legible to non-specialist audiences (Adi, 2026, Ch. 5). The five plots below form a deliberate narrative arc — from portfolio growth, through structural composition, to risk concentration.
6.2 Business Justification
Risk committee briefings depend on visual clarity. A chart showing that 12-installment loans carry a 33.3% arrears rate communicates risk concentration immediately. In my professional role, monthly portfolio dashboards are the primary briefing tool for senior leadership, and these chart choices directly reflect that context.
6.3 Five-Plot Visual Narrative
These five plots tell one connected story: the portfolio has grown consistently and is broadly healthy — but a specific risk concentration defined by installment count and facility size demands targeted underwriting attention.
df_monthly <- df |>group_by(month) |>summarise(count =n(), total_bn =sum(loan_amount)/1e9, .groups ="drop")ggplot(df_monthly, aes(x = month)) +geom_col(aes(y = total_bn), fill ="#378ADD", alpha =0.85, width =20) +geom_line(aes(y = count/15), colour ="#E24B4A",linewidth =1.1, group =1) +geom_point(aes(y = count/15), colour ="#E24B4A", size =2.8) +scale_y_continuous(name ="Total disbursed (₦ billion)",sec.axis =sec_axis(~.*15, name ="Number of loans") ) +scale_x_date(date_labels ="%b %Y", date_breaks ="1 month") +labs(title ="Figure 3: Monthly loan disbursements — May 2025 to April 2026",subtitle ="Bars = total value (₦B, left axis) | Line = loan count (right axis)",x =NULL ) +theme_minimal(base_size =11) +theme(axis.text.x =element_text(angle =35, hjust =1))
Code
ggplot(df, aes(x = loan_name_short, y = loan_amount/1e6,fill = loan_name_short)) +geom_violin(alpha =0.55, trim =TRUE, colour ="white") +geom_jitter(aes(colour =factor(in_arrears)),width =0.12, size =1.6, alpha =0.8) +scale_fill_manual(values =c("Business Loan Std"="#378ADD","LPO Financing"="#EF9F27","BL Quarterly Deferred"="#1D9E75","BL Deferred Principal"="#A32D2D")) +scale_colour_manual(values =c("0"="#888780","1"="#E24B4A"),labels =c("0"="Performing","1"="In Arrears"),name ="Status") +labs(title ="Figure 4: Loan amount and arrears by product type",subtitle ="Red dots = In Arrears | BL Deferred Principal: 0% arrears across all 32 loans",x ="Product", y ="Loan amount (₦M)" ) +coord_flip() +theme_minimal(base_size =11) +theme(legend.position ="bottom")
Code
ggplot(df, aes(x = interest_rate, fill = loan_name_short)) +geom_density(alpha =0.55, colour ="white") +scale_fill_manual(values =c("Business Loan Std"="#378ADD","LPO Financing"="#EF9F27","BL Quarterly Deferred"="#1D9E75","BL Deferred Principal"="#A32D2D"),name ="Product") +labs(title ="Figure 5: Interest rate density by product type",subtitle ="Business Loan Std clusters at 6.38–6.83% | LPO and Deferred products: 3.5–5%",x ="Monthly interest rate (%)", y ="Density" ) +theme_minimal(base_size =11) +theme(legend.position ="bottom")
Code
arrears_install <- df |>filter(num_installments %in%c(1,2,3,6,9,12)) |>group_by(num_installments) |>summarise(n=n(), rate=mean(in_arrears), .groups="drop")portfolio_avg <-mean(df$in_arrears)ggplot(arrears_install, aes(x =factor(num_installments), y = rate,fill = rate > portfolio_avg)) +geom_col(width =0.6, alpha =0.85) +geom_hline(yintercept = portfolio_avg,linetype ="dashed", colour ="#888780", linewidth =0.8) +geom_text(aes(label =paste0(round(rate*100,1), "%\nn=", n)),vjust =-0.25, size =3.2) +scale_fill_manual(values =c("FALSE"="#1D9E75","TRUE"="#E24B4A"),guide ="none") +scale_y_continuous(labels = percent, limits =c(0, 0.45)) +annotate("text", x =0.65, y = portfolio_avg +0.02,label =glue("Portfolio avg {percent(portfolio_avg, 0.1)}"),size =3, colour ="#888780") +labs(title ="Figure 6: Arrears rate by number of installments",subtitle ="12-installment loans: 33.3% arrears — nearly 3× the portfolio average of 12.1%",x ="Number of installments", y ="Arrears rate" ) +theme_minimal(base_size =11)
Code
ggplot(df, aes(x = loan_amount/1e6, y = interest_rate,colour =factor(in_arrears))) +geom_point(alpha =0.55, size =1.8) +geom_smooth(data =filter(df, in_arrears==0), method ="lm", se =TRUE,colour ="#378ADD", fill ="#B5D4F4", linewidth =0.8) +scale_colour_manual(values =c("0"="#378ADD","1"="#E24B4A"),labels =c("0"="Performing","1"="In Arrears"),name ="Status") +scale_x_log10(labels = comma) +labs(title ="Figure 7: Loan amount vs interest rate — performing vs arrears",subtitle ="Log scale. Arrears loans cluster at smaller amounts. Trend line = performing loans only.",x ="Loan amount (₦M, log scale)", y ="Monthly rate (%)" ) +theme_minimal(base_size =11) +theme(legend.position ="bottom")
Code
import seaborn as snsimport matplotlib.pyplot as pltfig, axes = plt.subplots(1, 2, figsize=(13, 4))arr_i = (df_py[df_py["num_installments"].isin([1,2,3,6,9,12])] .groupby("num_installments")["in_arrears"] .agg(["mean","count"]).reset_index())colours = ["#E24B4A"if v > df_py["in_arrears"].mean() else"#1D9E75"for v in arr_i["mean"]]axes[0].bar(arr_i["num_installments"].astype(str), arr_i["mean"]*100, color=colours, alpha=0.85)axes[0].axhline(df_py["in_arrears"].mean()*100, linestyle="--", color="#888780", linewidth=1, label="Portfolio avg")axes[0].set_title("Arrears rate by installments (%)", fontsize=11)axes[0].set_xlabel("Installments"); axes[0].legend(fontsize=9)arr_b = (df_py.groupby("amount_bin", observed=True)["in_arrears"] .agg(["mean","count"]).reset_index())axes[1].bar(arr_b["amount_bin"].astype(str), arr_b["mean"]*100, color="#E24B4A", alpha=0.75)axes[1].set_title("Arrears rate by loan amount band (%)", fontsize=11)axes[1].set_xlabel("Amount band"); axes[1].set_ylabel("Arrears rate (%)")plt.tight_layout()plt.savefig("viz_py.png", dpi=150, bbox_inches="tight")plt.show()
6.4 Business Interpretation
The five plots build one strategic argument. Figure 3 establishes sustained portfolio growth — disbursements rose from ₦2.07B in August 2025 to ₦6.03B in April 2026, with loan counts growing from 35 to 87 in the same window. Figure 4 reveals the most striking product-level finding: all 32 BL Deferred Principal loans are performing (0% arrears), while Business Loan Std (n=471) carries the bulk of portfolio risk. Figure 5 confirms clear tiered pricing, with rates separating cleanly by product type. Figure 6 delivers the most actionable operational finding: 12-installment loans carry a 33.3% arrears rate and 9-installment loans carry 20.7% — both nearly three times the portfolio average. Figure 7 shows that arrears loans cluster in the lower-amount region, consistent with the logistic regression’s amount coefficient.
7. Technique 3 — Hypothesis Testing
7.1 Theory
Hypothesis testing provides a formal framework for distinguishing real effects from random variation. The analyst specifies H₀ (no effect) and H₁ (an effect exists), selects a test appropriate to the data type and distributional assumptions, and evaluates both p-value and effect size. Equally important: a null result is not a failure — it is evidence that a suspected pattern does not exist, and it prevents the analyst from recommending policy changes based on noise (Adi, 2026, Ch. 6).
7.2 Business Justification
Two questions carry direct policy implications: (1) Do interest rates genuinely differ across loan products? (2) Is loan lifecycle stage associated with arrears? Confirming question one validates the organisation’s tiered pricing policy. The null result on question two — that New, Renewal, and Top Up loans have statistically identical arrears rates — prevents an approval process change that would create friction without improving risk detection.
7.3 Hypothesis 1 — Do Interest Rates Differ Across Loan Product Types?
H₀: The distribution of monthly interest rates is identical across all four loan product categories. H₁: At least one product type has a significantly different interest rate distribution. Test: Kruskal-Wallis — normality violated (Shapiro-Wilk: Amount W=0.752, p<0.001; Rate W=0.877, p<0.001). α = 0.05
from scipy import stats as scgroups = [g["interest_rate"].values for _,g in df_py.groupby("loan_name_short")]h, p = sc.kruskal(*groups)print(f"Kruskal-Wallis (Rate ~ Product): H={h:.3f}, p={p:.2e}")
count mean median std
loan_name_short
BL Deferred Principal 32 4.087 3.500 1.241
BL Quarterly Deferred 45 5.158 4.935 0.665
Business Loan Std 471 6.621 6.380 0.770
LPO Financing 81 4.254 4.200 0.691
Result: χ²(3) = 293.06, p < 0.001. Reject H₀.
Business interpretation: The pricing differential across products is statistically confirmed, not random. Business Loan Std averages 6.62%/month versus 4.26% for LPO Financing and 4.09% for BL Deferred Principal. For the credit committee: “Our tiered pricing is working as designed — each product carries a genuinely distinct rate band, and these differences are highly consistent. This pricing architecture is statistically valid.”
7.4 Hypothesis 2 — Is Loan Lifecycle Stage Associated with Arrears?
H₀: Arrears status is independent of loan lifecycle stage (New / Renewal / Top Up). H₁: Arrears status is associated with loan lifecycle stage. Test: Pearson Chi-squared test of independence. α = 0.05
count sum mean
loan_type
New 296 40 0.135
Renewal 213 22 0.103
Top Up 120 14 0.117
Result: χ²(2) = 1.207, p = 0.547, Cramér’s V = 0.044. Fail to reject H₀.
Business interpretation: This null result is among the most practically important findings in the study. Loan lifecycle stage — whether a loan is New (13.5%), Renewal (10.3%), or Top Up (11.7%) — is not significantly associated with arrears. The differences are negligible and entirely consistent with random sampling variation. For the credit committee: “Do not introduce lifecycle stage as a credit screening criterion. The data provides no evidence that Renewals or Top Ups are riskier than new loans. Such a rule would add process friction without improving risk detection — resources are better directed at installment count and facility size.”
8. Technique 4 — Correlation Analysis
8.1 Theory
Correlation analysis measures the strength and direction of association between pairs of numeric variables. Spearman’s ρ is used throughout as the non-parametric alternative appropriate when data are skewed. The foundational caveat: correlation is not causation — a significant correlation may reflect a common underlying driver rather than a direct relationship (Adi, 2026, Ch. 8).
8.2 Business Justification
Understanding how loan size, pricing, and installment structure relate to one another drives every credit strategy conversation I participate in. The positive correlation between installment count and interest rate (ρ = +0.372) suggests the organisation’s product design already partially encodes installment risk — but this has not been made explicit in the pre-disbursement approval process.
1. Loan Amount ↔︎ Interest Rate (ρ = −0.563, moderate-strong negative, p < 0.001) The strongest correlation: larger loans attract lower monthly rates. This is consistent with tiered pricing where larger, typically better-collateralised clients negotiate lower rates. BL Deferred Principal loans (large amounts, lowest rates, 0% arrears) drive much of this relationship. Business implication: size and pricing are jointly reflecting credit quality, but neither alone is a sufficient risk predictor.
2. Interest Rate ↔︎ Number of Installments (ρ = +0.372, moderate positive, p < 0.001) Higher-rate loans tend to have more installments. This suggests the organisation’s product design already partially encodes repayment risk in its installment structure — longer repayment schedules accompany riskier, higher-rate facilities. This relationship explains why installment count predicts arrears even after controlling for rate in the regression.
3. Interest Rate ↔︎ In Arrears (ρ = +0.146, small positive, p < 0.001) Higher rates are modestly associated with arrears. Rate partially captures credit risk but is insufficient as a standalone screening tool — confirmed by its non-significance in the multivariate regression.
4. Number of Installments ↔︎ In Arrears (ρ = +0.136, small positive, p < 0.001) More installments are associated with higher arrears probability. With ρ = 0.136, the raw correlation is modest, but the installment coefficient in the regression (OR = 1.32, p < 0.001) confirms this relationship holds independently when other variables are controlled.
Causation note: The installments–arrears correlation likely reflects borrower credit quality (weaker borrowers may negotiate longer repayment windows), not a direct effect of installment count on default behaviour. Assigning more installments does not cause arrears — both may be driven by the borrower’s underlying financial strength. Controlling experiments would be needed to establish causation.
9. Technique 5 — Logistic Regression
9.1 Theory
Logistic regression models the log-odds of a binary outcome as a linear function of predictor variables. It is appropriate when the outcome is categorical (arrears = 1, performing = 0). Coefficients are exponentiated to yield odds ratios (ORs): OR > 1 increases arrears odds; OR < 1 decreases them. Performance is assessed via AUC — the probability that the model ranks a randomly chosen arrears loan above a randomly chosen performing loan. An AUC of 0.668 means the model achieves this 66.8% of the time, compared to 50% by random chance (Adi, 2026, Ch. 13).
Note on perfect separation: BL Deferred Principal loans have 0% arrears across all 32 loans. Including this product as a dummy predictor causes complete separation — the algorithm would assign an infinitely negative coefficient with unreliable standard errors. This product is therefore excluded from the regression model but fully documented in Sections 5 and 6. Its 0% arrears rate is an important finding: the Deferred Principal structure is associated with near-zero credit loss.
9.2 Business Justification
The operational goal is a pre-disbursement risk score using only information available at the point of loan approval. With an AUC of 0.668, this model correctly ranks a risky loan above a safe loan about 67% of the time — meaningfully better than guessing, and sufficient to serve as a triage tool flagging applications for secondary review.
9.4 Business Interpretation of Significant Coefficients
The model achieves AUC = 0.668, using only information available at the point of disbursement. Two predictors are statistically significant at α = 0.05:
Number of Installments (OR = 1.32, p < 0.001): Each additional installment multiplies the odds of arrears by 1.32, holding all other factors constant. For a non-technical manager: “A loan structured over 9 monthly installments is approximately 1.32³ = 2.3× more likely to fall into arrears than an equivalent 6-installment loan. A 12-installment loan is 1.32⁶ = 5.3× riskier. Number of installments is the most reliable arrears signal available at approval time.”
Log Loan Amount (OR = 0.64, p = 0.015): Each unit increase in log loan amount reduces arrears odds by 36%. In plain terms: larger loans are less likely to default. Loans below ₦10M carry a 42.9% arrears rate, while loans above ₦100M carry only 4.9%. For the credit committee: “Smaller facilities carry disproportionate credit risk — likely because smaller borrowers have less financial resilience and receive less intensive underwriting. Size is a meaningful pre-disbursement risk signal.”
Deployment recommendation: Implement two pre-disbursement flags in the loan approval checklist: (1) Number of installments ≥ 9 → mandatory secondary credit committee review; (2) Loan amount below ₦10M → enhanced financial documentation required. Both are based on information already captured in the LMS at the point of application. No new data infrastructure is needed.
10. Integrated Findings
The five analytical techniques form a chain of evidence, each reinforcing and extending the others.
EDA established the data landscape — 629 loans over twelve clean calendar months — and correctly classified 331 missing closure dates as open loans, not data errors. It also profiled Number of Installments for the first time, revealing that while most loans use 6 installments (73.9%), the risk-significant tails at 9 and 12 installments carry arrears rates nearly three times the portfolio average. Visualisation converted these patterns into strategic clarity: BL Deferred Principal loans (0% arrears across 32 loans) and 12-installment facilities (33.3% arrears) represent the two ends of the risk spectrum.
Hypothesis testing added statistical rigour — confirming that tiered pricing across products is real (Kruskal-Wallis p < 0.001), and delivering an equally important null result: loan lifecycle stage does not predict arrears (p = 0.547). This null finding prevents a misguided policy intervention and illustrates that hypothesis testing’s value lies as much in ruling things out as in confirming them. Correlation analysis revealed a new structural relationship: installment count and rate are positively correlated (ρ = +0.372), suggesting the organisation’s product design already partially prices installment risk — but without making this explicit as a disbursement-stage gate. Logistic regression unified all variables into a predictive framework, with Number of Installments (OR = 1.32) and Loan Amount (OR = 0.64) as the only two statistically significant independent predictors. AUC = 0.668 confirms this constitutes a useful, deployable triage tool.
Single integrated recommendation: Introduce a Pre-Disbursement Risk Checklist with two mandatory escalation triggers: (1) any loan application with 9 or more installments → secondary credit committee sign-off required; (2) any loan below ₦10M → enhanced financial statement and collateral documentation required. These rules target both statistically confirmed risk predictors, are operationally actionable without new systems, and are defensible to regulators on the basis of the evidence assembled across all five techniques in this study.
11. Limitations & Further Work
1. BL Deferred Principal excluded from regression. The 0% arrears rate causes perfect separation in the logistic model, requiring exclusion of this product. With a larger historical sample, Firth’s penalised logistic regression would handle this without dropping the product.
2. Class imbalance (88% performing, 12% arrears). The model underperforms at identifying true arrears cases (low sensitivity). With more time, SMOTE oversampling or cost-sensitive learning would improve recall for the arrears class — critical when the cost of a missed default exceeds the cost of a false alarm.
3. Open loans may not yet have revealed their final state. Of the 331 still-open loans (51.5% of the portfolio), some Active loans may eventually fall into arrears after the extraction date. The true 12-month arrears rate could therefore be higher than 12.08% once all loans close. A follow-up extraction in 6 months would provide a more complete picture.
4. No borrower-level covariates. Industry sector, years in operation, prior credit history, and collateral type — standard credit application fields — are absent. Incorporating these from credit files would likely push AUC above 0.80 and enable more precise individual loan scoring.
5. Installment count determinants are unobserved. Understanding why some borrowers receive 9–12 installments (negotiated? product-driven? credit committee discretion?) would clarify whether the arrears association is causal or mediated by borrower credit quality. This warrants qualitative investigation alongside the quantitative evidence.
References
Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online
Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048
[Anurika Orabuche]. (2026). Business loan portfolio dataset — May 2025 to April 2026 [Dataset]. Collected from a Nigerian commercial lending institution. Data available on request from the author.
McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/
Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.
Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace.
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Wilkinson, L. (2005). The grammar of graphics (2nd ed.). Springer.
Appendix: AI Usage Statement
Claude (Anthropic) was used during the preparation of this assignment to assist with structuring the Quarto document template, generating R and Python code scaffolding for the five analytical sections, and advising on the appropriate handling of perfect separation in the logistic regression. All analytical decisions — the choice of a twelve-month study window, the correct interpretation of missing closure dates as open loans, the identification of Number of Installments as the primary new predictor variable, the choice and justification of each hypothesis test, the interpretation of the null result for Loan Type, the interpretation of regression coefficients, the integrated recommendation regarding installment count and loan size thresholds, and all limitations identified — were made independently by the author based on domain knowledge and direct review of all model outputs. The dataset was collected, extracted, and verified by the author. The organisation name has been withheld to protect commercial confidentiality; no other data modifications were made. The author takes full responsibility for all conclusions presented and is prepared to explain and defend every result in the viva voce examination.
Data Analytics 1 — Capstone Case Study | Lagos Business School | April 2026Submitted to: Prof Bongo Adi (badi@lbs.edu.ng)