Predicting Business Loan Arrears: An Exploratory and Inferential Analytics Study of a Nigerian Commercial Lending Portfolio (March 2025 – May 2026)
Author
[Anurika Orabuche]
Published
May 13, 2026
1. Executive Summary
This study analyses 661 business loans disbursed by a lending company between March 2025 and May 2026, representing a total portfolio value of ₦40.53 billion. With 77 loans (11.65%, totalling approximately ₦2.94 billion) currently in arrears, credit risk management has become a central operational priority. Monthly disbursements grew from under ₦3 billion in mid-2025 to over ₦6 billion in April 2026, reflecting robust portfolio expansion that demands proportionately rigorous risk monitoring.
Five analytical techniques are applied to a real dataset extracted directly from the organisation’s loan management system. Exploratory data analysis reveals that loan amounts are severely right-skewed (median ₦33M, mean ₦61.3M, max ₦300M) and identifies two data quality issues. Data visualisation tells a coherent portfolio story: arrears are concentrated in standard Business Loans (13.6%), loans with 12 installments (31.3%), and facilities below ₦10M (50%). Hypothesis testing formally confirms that interest rates differ significantly across loan product categories (Kruskal-Wallis, p < 0.001), while loan lifecycle stage (New/Renewal/Top Up) is not significantly associated with arrears status (p = 0.459) — an instructive null result. Correlation analysis uncovers a strong negative rate–amount relationship (Spearman ρ = −0.565) and a meaningful positive installments–rate correlation (ρ = +0.354). Logistic regression identifies Number of Installments (OR = 1.33, p < 0.001) and log Loan Amount (OR = 0.64, p = 0.016) as the two statistically significant predictors of arrears, producing a model AUC of 0.685.
The primary recommendation is that the organisation introduce a pre-disbursement review trigger for loans with nine or more installments, and apply enhanced due diligence to facilities below ₦10M — the two most concentrated arrears risk dimensions identified by the data.
2. Professional Disclosure
Job Title: [Head of Internal Control and Audit] Organisation Type: [ Microfinance Bank / Fintech] — Nigeria Sector: Financial Services
Technique Justifications
Exploratory Data Analysis (EDA): In my role I review loan application files and monthly portfolio performance reports. EDA is the indispensable first step: before drawing any conclusions, I must understand how loan amounts are distributed, identify missing or inconsistent data in the loan management system, and detect outliers that could distort model results. The dataset here contains activation timestamps with time-of-day components and a right-skewed amount distribution — both of which require operational decisions before analysis can proceed. These findings feed directly into IT and operations escalations about LMS data entry controls.
Data Visualisation: Credit committee presentations and board risk briefings demand visual storytelling. A chart showing that loans with 12 installments carry a 31.3% arrears rate communicates risk concentration in seconds. In my day-to-day role, monthly portfolio dashboards are the primary tool for briefing senior leadership — the five-plot narrative here directly mirrors that professional context.
Hypothesis Testing: Before recommending a policy change — such as mandatory secondary review for high-installment loans — I must demonstrate that the observed arrears rate difference is statistically significant. Equally important is the null result: the finding that loan lifecycle stage (New/Renewal/Top Up) is not significantly associated with arrears (p = 0.459) prevents a misguided policy that would flag all Renewal loans, when the data shows this distinction does not predict default.
Correlation Analysis: The pricing and structural architecture of the portfolio — who gets how many installments, at what rate, for what loan size — is central to credit strategy discussions I participate in. Discovering that interest rate and number of installments are moderately correlated (ρ = +0.354) raises the question of whether installment structure is acting as a risk proxy in our pricing model. These are conversations that require data.
Logistic Regression: The operational goal is a pre-disbursement risk score: given what is observable about a loan at approval time, what is the probability it falls into arrears? Logistic regression produces that probability. The model’s installment coefficient (OR = 1.33 per additional installment) already provides a concrete rule that can be embedded in the loan approval workflow today.
3. Data Collection & Sampling
3.1 Source and Collection Method
The dataset was extracted from the organisation’s internal Loan Management System (LMS), which records all loan approvals, disbursements, and repayment events. The extract was generated via a direct database query covering loans with an Activation Date between 5 March 2025 and 8 May 2026, exported as a Microsoft Excel file, and loaded into R for analysis. No external or publicly available datasets supplement this analysis.
3.2 Sampling Frame
The sampling frame is the complete population of business loans disbursed during the fifteen-month study window — a census of all 661 loan records activated in the LMS during that period.
3.3 Variables
Variable
Type
Description
activation_date
Date/time
Timestamp of loan approval and activation
first_repayment_date
Date
Scheduled date of first repayment
closed_date
Date
Date loan was fully repaid (NA if still open)
account_state
Categorical
Current status: Active, Closed, In Arrears
loan_name
Categorical
Full product name (4 categories)
loan_amount
Numeric (₦)
Principal disbursed
interest_rate
Numeric (%)
Monthly interest rate
num_installments
Integer
Number of scheduled repayment installments
loan_type
Categorical
Lifecycle stage: New, Renewal, Top Up
duration_days
Numeric
Days from activation to closure (derived; NA if open)
days_to_first_repayment
Numeric
Days from activation to first repayment (derived)
in_arrears
Binary 0/1
1 if account_state == “In Arrears” (derived outcome)
3.4 Time Period and Ethics
Period: 5 March 2025 to 8 May 2026 (approximately 15 months). All data relates to corporate business loan accounts; no personally identifiable information relating to natural persons is included. Loan identifiers are sequential codes. The analysis was conducted with the knowledge and approval of the relevant department head. Data is available on request from the author.
Exploratory Data Analysis (EDA), formalised by Tukey (1977), is the systematic use of statistical summaries and graphical displays to understand a dataset’s structure before applying formal models. Key activities include distributional assessment, outlier detection, missing-value quantification, and data quality auditing. Adi (2026) uses Anscombe’s Quartet to illustrate the foundational principle: four datasets with identical summary statistics can exhibit radically different structures when visualised — making combined numeric and visual EDA essential before any inferential work.
5.2 Business Justification
A data audit is the first step in every portfolio review I conduct. The consequences of skipping EDA are real: the activation timestamps in this dataset include time-of-day components that must be stripped to date-only format before any monthly aggregation; loan amounts are severely right-skewed, meaning mean-based portfolio summaries overstate the typical facility by 86%. Both issues would silently corrupt downstream analyses.
vis_miss(df |>select(loan_amount, interest_rate, num_installments, days_to_first_repayment, closed_date, duration_days)) +labs(title ="Figure 1: Missing value map",subtitle ="closed_date and duration_days missing only for Active / In Arrears loans — structurally expected" ) +theme_minimal(base_size =11)
Applied floor_date(activation_date, "month") in R; dt.to_period("M") in Python
2
Missing Closed Date (358 / 54.2%)
All Active (281) and In Arrears (77) loans have no closure date — they are still open
Retained; duration_days = NA structurally for open loans — documented, not imputed
Notable: No Loan Type encoding issues. Unlike earlier datasets, Loan Type is clean with exactly three consistent categories — New (309), Renewal (223), Top Up (129). No standardisation required.
5.5 Business Interpretation
Three findings have direct operational consequences. First, activation timestamps include time-of-day precision that is inconsistent with how the LMS is used in practice — loans are approved within business hours but timestamps vary, suggesting the system records server time rather than user action time. This is a reporting system note, not an analysis barrier. Second, loan amounts are severely right-skewed (skewness = 1.975): the mean (₦61.3M) is 86% above the median (₦33M), meaning any board report using average loan size will significantly overstate the typical facility. Third, the Number of Installments distribution is strongly concentrated at 6 (74% of loans), with a long tail at 9 (4.5%) and 12 (2.4%) — these tail categories carry arrears rates of 20% and 31.3% respectively, making them high-priority segments for risk monitoring despite their small share of volume.
6. Technique 2 — Data Visualisation
6.1 Theory
The grammar of graphics (Wilkinson, 2005), implemented in R’s ggplot2, provides a principled framework for constructing statistical graphics by mapping data attributes to visual channels: position, colour, size, and shape. Effective business visualisation requires selecting chart types matched to the relationship being shown, eliminating non-data ink, and ensuring the key message is legible without specialist training (Adi, 2026, Ch. 5). The five plots below form a deliberate narrative arc — from portfolio growth, through structural composition, to risk concentration.
6.2 Business Justification
Risk committee briefings depend on visual clarity. The finding that 12-installment loans have a 31.3% arrears rate — versus 12.1% for standard 6-installment loans — communicates the risk gradient in one glance. In my professional role, monthly portfolio dashboards are the primary briefing tool for senior leadership, and the chart selection here reflects that context directly.
6.3 Five-Plot Visual Narrative
The five plots tell one connected story: the portfolio has grown rapidly and is structurally healthy, but a specific risk concentration defined by installment count and facility size demands targeted attention.
df_monthly <- df |>group_by(month) |>summarise(count =n(), total_bn =sum(loan_amount)/1e9, .groups ="drop")ggplot(df_monthly, aes(x = month)) +geom_col(aes(y = total_bn), fill ="#378ADD", alpha =0.85, width =20) +geom_line(aes(y = count /15), colour ="#E24B4A", linewidth =1.1, group =1) +geom_point(aes(y = count /15), colour ="#E24B4A", size =2.8) +scale_y_continuous(name ="Total disbursed (₦ billion)",sec.axis =sec_axis(~ . *15, name ="Number of loans") ) +scale_x_date(date_labels ="%b %Y", date_breaks ="1 month") +labs(title ="Figure 3: Monthly loan disbursements — March 2025 to May 2026",subtitle ="Bars = total value (₦B, left axis) | Line = loan count (right axis)",x =NULL ) +theme_minimal(base_size =11) +theme(axis.text.x =element_text(angle =35, hjust =1))
Code
ggplot(df, aes(x = loan_name_short, y = loan_amount/1e6, fill = loan_name_short)) +geom_violin(alpha =0.55, trim =TRUE, colour ="white") +geom_jitter(aes(colour =factor(in_arrears)), width =0.12, size =1.6, alpha =0.8) +scale_fill_manual(values =c("Business Loan Std"="#378ADD","LPO Financing"="#EF9F27","BL Quarterly Deferred"="#1D9E75","BL Deferred Principal"="#A32D2D")) +scale_colour_manual(values =c("0"="#888780","1"="#E24B4A"),labels =c("0"="Performing","1"="In Arrears"), name ="Status") +labs(title ="Figure 4: Loan amount and arrears status by product type",subtitle ="Red dots = loans currently in arrears | BL Deferred Principal: 0% arrears across all 32 loans",x ="Product", y ="Loan amount (₦M)" ) +coord_flip() +theme_minimal(base_size =11) +theme(legend.position ="bottom")
Code
ggplot(df, aes(x = interest_rate, fill = loan_name_short)) +geom_density(alpha =0.55, colour ="white") +scale_fill_manual(values =c("Business Loan Std"="#378ADD","LPO Financing"="#EF9F27","BL Quarterly Deferred"="#1D9E75","BL Deferred Principal"="#A32D2D"),name ="Product") +labs(title ="Figure 5: Interest rate density by product type",subtitle ="Business Loan Std clusters at 6.36–6.83% | LPO and Deferred products carry lower rates (3.5–5%)",x ="Monthly interest rate (%)", y ="Density" ) +theme_minimal(base_size =11) +theme(legend.position ="bottom")
Code
arrears_install <- df |>filter(num_installments %in%c(1,2,3,6,9,12)) |>group_by(num_installments) |>summarise(n =n(), rate =mean(in_arrears), .groups ="drop")portfolio_avg <-mean(df$in_arrears)ggplot(arrears_install, aes(x =factor(num_installments), y = rate,fill = rate > portfolio_avg)) +geom_col(width =0.6, alpha =0.85) +geom_hline(yintercept = portfolio_avg,linetype ="dashed", colour ="#888780", linewidth =0.8) +geom_text(aes(label =paste0(round(rate*100,1), "%\nn=", n)),vjust =-0.25, size =3.2) +scale_fill_manual(values =c("FALSE"="#1D9E75","TRUE"="#E24B4A"), guide ="none") +scale_y_continuous(labels = percent, limits =c(0, 0.42)) +annotate("text", x =0.65, y = portfolio_avg +0.02,label =glue("Portfolio avg {percent(portfolio_avg, 0.1)}"),size =3, colour ="#888780") +labs(title ="Figure 6: Arrears rate by number of installments",subtitle ="12-installment loans carry a 31.3% arrears rate — nearly 3× the portfolio average",x ="Number of installments", y ="Arrears rate" ) +theme_minimal(base_size =11)
Code
ggplot(df, aes(x = loan_amount/1e6, y = interest_rate,colour =factor(in_arrears))) +geom_point(alpha =0.55, size =1.8) +geom_smooth(data =filter(df, in_arrears ==0), method ="lm", se =TRUE,colour ="#378ADD", fill ="#B5D4F4", linewidth =0.8) +scale_colour_manual(values =c("0"="#378ADD","1"="#E24B4A"),labels =c("0"="Performing","1"="In Arrears"), name ="Status") +scale_x_log10(labels = comma) +labs(title ="Figure 7: Loan amount vs interest rate — performing vs arrears",subtitle ="Log scale. Arrears loans (red) cluster at smaller amounts and varied rates. Trend = performing loans.",x ="Loan amount (₦M, log scale)", y ="Monthly rate (%)" ) +theme_minimal(base_size =11) +theme(legend.position ="bottom")
Code
import seaborn as snsimport matplotlib.pyplot as pltfig, axes = plt.subplots(1, 2, figsize=(13, 4))arr_install = (df_py[df_py["num_installments"].isin([1,2,3,6,9,12])] .groupby("num_installments")["in_arrears"].agg(["mean","count"]).reset_index())colours = ["#E24B4A"if v > df_py["in_arrears"].mean() else"#1D9E75"for v in arr_install["mean"]]axes[0].bar(arr_install["num_installments"].astype(str), arr_install["mean"]*100, color=colours, alpha=0.85)axes[0].axhline(df_py["in_arrears"].mean()*100, linestyle="--", color="#888780", linewidth=1, label="Portfolio avg")axes[0].set_title("Arrears rate by number of installments (%)", fontsize=11)axes[0].set_xlabel("Installments"); axes[0].legend(fontsize=9)arr_ab = (df_py.groupby("amount_bin", observed=True)["in_arrears"] .agg(["mean","count"]).reset_index())axes[1].bar(arr_ab["amount_bin"].astype(str), arr_ab["mean"]*100, color="#E24B4A", alpha=0.75)axes[1].set_title("Arrears rate by loan amount band (%)", fontsize=11)axes[1].set_xlabel("Amount band"); axes[1].set_ylabel("Arrears rate (%)")plt.tight_layout()plt.savefig("viz_py.png", dpi=150, bbox_inches="tight")plt.show()
6.4 Business Interpretation
The five plots build a single strategic argument. Figure 3 establishes rapid, sustained portfolio growth: disbursements rose from under ₦3B per month in mid-2025 to over ₦6B in April 2026, with loan counts growing from 46 to 93 in the same period. Figure 4 reveals a striking product-level pattern — BL Deferred Principal loans (all 32 of them) show zero arrears, while Business Loan Std (n=500) carries the bulk of portfolio risk. Figure 5 confirms a clear tiered pricing structure, with rates separating cleanly by product. Figure 6 delivers the most operationally actionable finding: 12-installment loans carry a 31.3% arrears rate, and 9-installment loans carry 20% — both well above the portfolio average of 11.65%. Figure 7 shows that arrears loans cluster in the lower-left quadrant (smaller facilities), consistent with the loan amount coefficient found in the regression.
7. Technique 3 — Hypothesis Testing
7.1 Theory
Hypothesis testing provides a formal framework for distinguishing real population-level effects from random sample variation. The analyst specifies H₀ (no effect) and H₁ (an effect exists), selects a test appropriate to data type and distributional assumptions, and evaluates the result using a p-value and effect size. Crucially, failing to reject H₀ is also an informative result — it prevents the analyst from acting on patterns that could be noise. Both significant and null results are reported here, as both carry direct policy implications (Adi, 2026, Ch. 6).
7.2 Business Justification
Two questions are most operationally relevant: (1) Do interest rates genuinely differ across loan product types, or are the observed differences random? (2) Is loan lifecycle stage (New/Renewal/Top Up) associated with arrears? The answers determine whether to maintain the current tiered pricing policy and whether to use lifecycle stage as a risk filter in loan approval.
7.3 Hypothesis 1 — Do Interest Rates Differ Across Loan Product Types?
H₀: The distribution of monthly interest rates is identical across all four loan product categories. H₁: At least one loan product type has a significantly different rate distribution. Test: Kruskal-Wallis (non-parametric ANOVA) — normality violated (Shapiro-Wilk: amount W = 0.773, p < 0.001; rate W = 0.913, p = 0.001). α = 0.05
from scipy import stats as scgroups = [g["interest_rate"].values for _, g in df_py.groupby("loan_name_short")]h, p = sc.kruskal(*groups)print(f"Kruskal-Wallis (Rate ~ Product): H={h:.3f}, p={p:.2e}")
count mean median std
loan_name_short
BL Deferred Principal 32 4.087 3.500 1.241
BL Quarterly Deferred 47 5.115 4.934 0.688
Business Loan Std 500 6.612 6.360 0.775
LPO Financing 82 4.261 4.200 0.690
Result: Kruskal-Wallis χ²(3) = 300.23, p < 0.001. Reject H₀.
Business interpretation: The interest rate differences across loan products are statistically confirmed — Business Loan Std averages 6.61%/month versus 4.26% for LPO Financing and 4.09% for Deferred Principal. For the credit committee: “Our tiered pricing is working exactly as designed. Each product carries a genuinely distinct rate band, and these differences are far too consistent to be accidental.”
7.4 Hypothesis 2 — Is Loan Lifecycle Stage Associated with Arrears?
H₀: Arrears status is independent of loan lifecycle stage (New / Renewal / Top Up). H₁: Arrears status is associated with loan lifecycle stage. Test: Pearson Chi-squared test of independence. α = 0.05
count sum mean
loan_type
New 309 41 0.133
Renewal 223 22 0.099
Top Up 129 14 0.109
Result: χ²(2) = 1.556, p = 0.459, Cramér’s V = 0.049. Fail to reject H₀.
Business interpretation: This null result is one of the most practically important findings in the study. Loan lifecycle stage — whether a loan is New, Renewal, or Top Up — is not significantly associated with arrears. The arrears rates are 13.3%, 9.9%, and 10.9% respectively — modest differences that are entirely consistent with random sampling variation. For the credit committee: “Do not use Renewal or Top Up status as a credit screening criterion. The data provides no evidence that these categories predict default differently from new loans. Doing so would add process friction without improving risk detection.”
8. Technique 4 — Correlation Analysis
8.1 Theory
Correlation analysis measures the strength and direction of association between pairs of numeric variables. Spearman’s ρ is used throughout this study as the non-parametric rank-based alternative appropriate when data are skewed. Partial correlation can isolate a relationship while controlling for a third variable. The foundational caveat: correlation is not causation — a significant correlation may reflect a common underlying driver rather than a direct causal relationship (Adi, 2026, Ch. 8).
8.2 Business Justification
The pricing and structural architecture of the portfolio — who gets how many installments, at what rate, for what loan size — drives every credit strategy discussion I participate in. The discovery that number of installments and interest rate are positively correlated (ρ = +0.354) raises an important strategic question: is our installment structure acting as an informal proxy for credit risk in our pricing model, and if so, is that proxy calibrated correctly?
1. Loan Amount ↔︎ Interest Rate (ρ = −0.565, moderate-strong negative, p < 0.001) The strongest correlation in the dataset: larger loans attract lower monthly rates. With a ρ stronger than in earlier datasets (−0.46 in Dataset 2), the tiered pricing signal is confirmed robustly. The business implication: our pricing model appropriately differentiates by size, and Deferred Principal products (larger loans, lowest rates) are driving much of this relationship.
2. Interest Rate ↔︎ Number of Installments (ρ = +0.354, moderate positive, p < 0.001) A new and analytically important finding. Higher-rate loans tend to have more installments — suggesting that longer repayment structures are associated with riskier credit profiles, or that the organisation uses installment count as part of its risk-differentiated product design. Either way, this correlation partly explains why installment count predicts arrears in the regression.
3. Interest Rate ↔︎ In Arrears (ρ = +0.148, small positive, p < 0.001) Higher-rate loans are modestly more likely to be in arrears — consistent with the regression result. Rate captures some credit risk but is not a sufficient standalone predictor.
4. Number of Installments ↔︎ In Arrears (ρ = +0.123, small positive, p < 0.01) More installments are associated with higher arrears probability. This is likely mediated by the rate–installment relationship (larger installment counts accompany higher rates, which in turn are riskier), but the installment coefficient remains significant in the regression even after controlling for rate.
Causation note: The rate–amount and installment–arrears correlations both have plausible causal mechanisms, but also plausible confounders (borrower credit quality simultaneously affecting loan size, rate, and structure). A controlled experiment — holding credit quality constant while varying installment count — would be required to establish causation. As Adi (2026) emphasises, correlation identifies patterns worth investigating, not mechanisms to act on in isolation.
9. Technique 5 — Logistic Regression
9.1 Theory
Logistic regression models the log-odds of a binary outcome as a linear function of predictor variables. Unlike ordinary least squares, it is appropriate when the outcome is categorical (here: arrears = 1, performing = 0). Coefficients are exponentiated to yield odds ratios (ORs): OR > 1 increases arrears odds; OR < 1 decreases them. Model performance is evaluated using the confusion matrix, sensitivity, specificity, and AUC. An AUC > 0.65 indicates useful discriminative power for a credit risk triage tool (Adi, 2026, Ch. 13).
Note on perfect separation: BL Deferred Principal loans have a 0% arrears rate (0 of 32 loans in arrears). Including this product as a dummy predictor causes complete separation in the logistic regression — the algorithm assigns an infinitely negative coefficient, making standard errors unreliable. This product is therefore excluded from the regression model but discussed in the EDA and visualisation sections. Its 0% arrears rate is a finding in its own right: Deferred Principal loan structures appear to be systematically safer.
9.2 Business Justification
The operational goal is a pre-disbursement risk score: given the observable characteristics of a loan at approval time, what is the probability it falls into arrears? With an AUC of 0.685, the model has useful discriminative power as a triage tool — it correctly ranks a randomly chosen arrears loan above a randomly chosen performing loan approximately 68.5% of the time, using only information available before disbursement.
9.4 Business Interpretation of Significant Coefficients
The model achieves AUC = 0.685, indicating useful discriminative power using only information available at disbursement. Two predictors are statistically significant at α = 0.05:
Number of Installments (OR = 1.33, p < 0.001): Each additional installment multiplies the odds of arrears by 1.33, holding all other factors constant. For a non-technical manager: “A loan structured with 9 monthly installments is approximately 1.33³ = 2.35× more likely to fall into arrears than an equivalent 6-installment loan, and a 12-installment loan is 1.33⁶ = 5.2× more likely. Installment count is our most reliable arrears signal at the point of approval.”
Log Loan Amount (OR = 0.64, p = 0.016): Each unit increase in log loan amount (approximately a 2.7× increase in naira terms) reduces arrears odds by 36%. In plain terms: larger loans are less likely to default. This is consistent with the amount-band finding in visualisation — loans below ₦10M have a 50% arrears rate, while loans above ₦100M have only 4.7%. For the credit committee: “Facility size is a genuine risk discriminator. Small loans carry disproportionate default risk — likely because smaller borrowers have less financial resilience and receive less rigorous underwriting.”
Deployment recommendation: Implement two pre-disbursement flags: (1) Number of installments ≥ 9 → mandatory secondary credit committee review; (2) Loan amount below ₦10M → require enhanced financial statement documentation. Both flags are derivable from information already captured in the LMS at the point of application and require no new data infrastructure.
10. Integrated Findings
The five analytical techniques build a mutually reinforcing chain of evidence.
EDA established the data landscape — 661 loans over 15 months — and identified two quality issues: time-of-day timestamps requiring truncation, and missing closure dates structurally expected for 358 open loans. It also profiled Number of Installments for the first time, revealing a concentration at 6 (74%) with meaningful long-tail risk at 9 and 12. Visualisation transformed these patterns into strategic insight: five plots collectively showed that portfolio growth has been sustained and strong, that BL Deferred Principal loans are performing perfectly (0% arrears), and that the installment-count risk gradient is steep and operationally visible.
Hypothesis testing added rigour — confirming that tiered pricing across products is statistically real (Kruskal-Wallis p < 0.001), and delivering an equally important null result: loan lifecycle stage does not predict arrears (p = 0.459), preventing a misguided policy intervention. Correlation analysis revealed a new structural relationship — rate and installment count are positively correlated (ρ = +0.354) — suggesting the organisation’s product design already partially accounts for repayment risk, but has not translated this into an explicit approval-stage risk flag. Logistic regression unified all variables into a single framework, identifying Number of Installments (OR = 1.33) and Loan Amount (OR = 0.64) as the two independently significant predictors, with an AUC of 0.685.
Single integrated recommendation: Introduce a Pre-Disbursement Installment and Size Checklist: any loan application with 9 or more installments, or a facility below ₦10M, should trigger mandatory secondary credit committee review before disbursement. These two criteria together identify the highest-risk segments in the portfolio, are supported by five independent analytical techniques, and can be implemented within the existing approval workflow without new technology.
11. Limitations & Further Work
1. BL Deferred Principal excluded from regression. The 0% arrears rate for this product causes perfect separation in the logistic model, forcing its exclusion from the regression. With a larger sample (n=32 is small) and more time, a penalised regression (Firth’s logistic regression) would handle this without excluding the product.
2. Class imbalance (88% performing, 12% arrears). The logistic model trained on imbalanced data underperforms at identifying true arrears cases (low sensitivity). SMOTE oversampling or cost-sensitive learning would improve recall for the arrears class — particularly important since the cost of a missed arrears case (credit loss) substantially exceeds the cost of a false alarm (unnecessary review).
3. Number of Installments not fully explained. The installment variable is new to this dataset and its determinants are not captured here. Understanding why some borrowers receive 9–12 installments (is it negotiated? product-driven? credit committee discretion?) would clarify whether the arrears association is causal or mediated by an unobserved credit quality variable.
4. No borrower-level covariates. Industry sector, years in operation, prior credit history, and collateral type — standard credit application fields — are absent. Incorporating these from credit files would likely push AUC above 0.80.
5. Fifteen-month window crosses a calendar year. The data spans March 2025 to May 2026, capturing seasonal demand cycles, but not long enough to fully model them. A 24–36 month panel with seasonal dummy variables or a time series component would separate cyclical from structural arrears patterns.
References
Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online
Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048
[Your Name]. (2026). Business loan portfolio dataset — March 2025 to May 2026 [Dataset]. Collected from [Organisation Name / Department], Lagos, Nigeria. Data available on request from the author.
McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/
Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.
Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace.
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Wilkinson, L. (2005). The grammar of graphics (2nd ed.). Springer.
Appendix: AI Usage Statement
Claude (Anthropic) was used during the preparation of this assignment to assist with structuring the Quarto document template, generating R and Python code scaffolding for the five analytical sections, and advising on appropriate handling of the perfect separation issue in the logistic regression (BL Deferred Principal exclusion). All analytical decisions — the identification of Number of Installments as the primary new predictor variable, the choice and justification of each hypothesis test, the interpretation of the null result for Loan Type, the interpretation of regression coefficients, the integrated recommendation regarding installment count and loan size thresholds, and all limitations identified — were made independently by the author based on domain knowledge and direct review of every model output. The dataset was collected, extracted, and verified by the author from the organisation’s internal loan management system. The author takes full responsibility for all conclusions presented and is prepared to explain and defend every result in the viva voce examination.
Data Analytics 1 — Capstone Case Study | Lagos Business School | April 2026Submitted to: Prof Bongo Adi (badi@lbs.edu.ng)