Predicting Business Loan Arrears: An Exploratory and Inferential Analytics Study of a Nigerian Commercial Lending Portfolio (March 2025 – May 2026)

Author

[Anurika Orabuche]

Published

May 13, 2026

1. Executive Summary

This study analyses 661 business loans disbursed by a lending company between March 2025 and May 2026, representing a total portfolio value of ₦40.53 billion. With 77 loans (11.65%, totalling approximately ₦2.94 billion) currently in arrears, credit risk management has become a central operational priority. Monthly disbursements grew from under ₦3 billion in mid-2025 to over ₦6 billion in April 2026, reflecting robust portfolio expansion that demands proportionately rigorous risk monitoring.

Five analytical techniques are applied to a real dataset extracted directly from the organisation’s loan management system. Exploratory data analysis reveals that loan amounts are severely right-skewed (median ₦33M, mean ₦61.3M, max ₦300M) and identifies two data quality issues. Data visualisation tells a coherent portfolio story: arrears are concentrated in standard Business Loans (13.6%), loans with 12 installments (31.3%), and facilities below ₦10M (50%). Hypothesis testing formally confirms that interest rates differ significantly across loan product categories (Kruskal-Wallis, p < 0.001), while loan lifecycle stage (New/Renewal/Top Up) is not significantly associated with arrears status (p = 0.459) — an instructive null result. Correlation analysis uncovers a strong negative rate–amount relationship (Spearman ρ = −0.565) and a meaningful positive installments–rate correlation (ρ = +0.354). Logistic regression identifies Number of Installments (OR = 1.33, p < 0.001) and log Loan Amount (OR = 0.64, p = 0.016) as the two statistically significant predictors of arrears, producing a model AUC of 0.685.

The primary recommendation is that the organisation introduce a pre-disbursement review trigger for loans with nine or more installments, and apply enhanced due diligence to facilities below ₦10M — the two most concentrated arrears risk dimensions identified by the data.

2. Professional Disclosure

Job Title: [Head of Internal Control and Audit] Organisation Type: [ Microfinance Bank / Fintech] — Nigeria Sector: Financial Services

Technique Justifications

Exploratory Data Analysis (EDA): In my role I review loan application files and monthly portfolio performance reports. EDA is the indispensable first step: before drawing any conclusions, I must understand how loan amounts are distributed, identify missing or inconsistent data in the loan management system, and detect outliers that could distort model results. The dataset here contains activation timestamps with time-of-day components and a right-skewed amount distribution — both of which require operational decisions before analysis can proceed. These findings feed directly into IT and operations escalations about LMS data entry controls.

Data Visualisation: Credit committee presentations and board risk briefings demand visual storytelling. A chart showing that loans with 12 installments carry a 31.3% arrears rate communicates risk concentration in seconds. In my day-to-day role, monthly portfolio dashboards are the primary tool for briefing senior leadership — the five-plot narrative here directly mirrors that professional context.

Hypothesis Testing: Before recommending a policy change — such as mandatory secondary review for high-installment loans — I must demonstrate that the observed arrears rate difference is statistically significant. Equally important is the null result: the finding that loan lifecycle stage (New/Renewal/Top Up) is not significantly associated with arrears (p = 0.459) prevents a misguided policy that would flag all Renewal loans, when the data shows this distinction does not predict default.

Correlation Analysis: The pricing and structural architecture of the portfolio — who gets how many installments, at what rate, for what loan size — is central to credit strategy discussions I participate in. Discovering that interest rate and number of installments are moderately correlated (ρ = +0.354) raises the question of whether installment structure is acting as a risk proxy in our pricing model. These are conversations that require data.

Logistic Regression: The operational goal is a pre-disbursement risk score: given what is observable about a loan at approval time, what is the probability it falls into arrears? Logistic regression produces that probability. The model’s installment coefficient (OR = 1.33 per additional installment) already provides a concrete rule that can be embedded in the loan approval workflow today.

3. Data Collection & Sampling

3.1 Source and Collection Method

The dataset was extracted from the organisation’s internal Loan Management System (LMS), which records all loan approvals, disbursements, and repayment events. The extract was generated via a direct database query covering loans with an Activation Date between 5 March 2025 and 8 May 2026, exported as a Microsoft Excel file, and loaded into R for analysis. No external or publicly available datasets supplement this analysis.

3.2 Sampling Frame

The sampling frame is the complete population of business loans disbursed during the fifteen-month study window — a census of all 661 loan records activated in the LMS during that period.

3.3 Variables

Variable	Type	Description
`activation_date`	Date/time	Timestamp of loan approval and activation
`first_repayment_date`	Date	Scheduled date of first repayment
`closed_date`	Date	Date loan was fully repaid (NA if still open)
`account_state`	Categorical	Current status: Active, Closed, In Arrears
`loan_name`	Categorical	Full product name (4 categories)
`loan_amount`	Numeric (₦)	Principal disbursed
`interest_rate`	Numeric (%)	Monthly interest rate
`num_installments`	Integer	Number of scheduled repayment installments
`loan_type`	Categorical	Lifecycle stage: New, Renewal, Top Up
`duration_days`	Numeric	Days from activation to closure (derived; NA if open)
`days_to_first_repayment`	Numeric	Days from activation to first repayment (derived)
`in_arrears`	Binary 0/1	1 if account_state == “In Arrears” (derived outcome)

3.4 Time Period and Ethics

Period: 5 March 2025 to 8 May 2026 (approximately 15 months). All data relates to corporate business loan accounts; no personally identifiable information relating to natural persons is included. Loan identifiers are sequential codes. The analysis was conducted with the knowledge and approval of the relevant department head. Data is available on request from the author.

4. Data Description

4.1 Load and Prepare Data

Code

library(tidyverse)
library(skimr)
library(janitor)
library(lubridate)
library(knitr)
library(kableExtra)
library(glue)
library(scales)

df <- read_csv("loan_data3.csv", show_col_types = FALSE) |>
  mutate(
    activation_date       = ymd(activation_date),
    first_repayment_date  = ymd(first_repayment_date),
    closed_date           = ymd(closed_date),
    in_arrears            = as.integer(in_arrears),
    loan_type             = factor(loan_type, levels = c("New","Renewal","Top Up")),
    loan_name_short = case_when(
      str_detect(loan_name, "LPO")       ~ "LPO Financing",
      str_detect(loan_name, "Quarterly") ~ "BL Quarterly Deferred",
      str_detect(loan_name, "Deferred")  ~ "BL Deferred Principal",
      TRUE                               ~ "Business Loan Std"
    ),
    loan_name_short = factor(loan_name_short,
      levels = c("Business Loan Std","LPO Financing",
                 "BL Quarterly Deferred","BL Deferred Principal")),
    log_amount  = log(loan_amount),
    amount_bin  = cut(loan_amount,
      breaks = c(0, 10e6, 25e6, 50e6, 100e6, Inf),
      labels = c("<₦10M","₦10–25M","₦25–50M","₦50–100M",">₦100M")),
    install_grp = case_when(
      num_installments <= 3  ~ "1–3",
      num_installments <= 6  ~ "4–6",
      num_installments <= 9  ~ "7–9",
      TRUE                   ~ "10–12"
    ),
    month = floor_date(activation_date, "month")
  )

glimpse(df)

Rows: 661
Columns: 17
$ activation_date         <date> 2025-10-04, 2026-01-30, 2025-07-03, 2025-11-1…
$ first_repayment_date    <date> 2025-11-04, 2026-03-05, 2025-08-02, 2025-12-1…
$ closed_date             <date> 2026-01-16, NA, 2025-09-29, 2026-04-23, 2026-…
$ account_state           <chr> "Closed", "Active", "Closed", "Closed", "Close…
$ loan_name               <chr> "Business Loan Deferred Principal", "Business …
$ loan_amount             <dbl> 200000000, 150000000, 50000000, 17000000, 7000…
$ interest_rate           <dbl> 5.740, 5.740, 4.800, 6.830, 6.050, 3.900, 5.74…
$ num_installments        <dbl> 6, 6, 2, 6, 6, 1, 6, 6, 6, 1, 4, 6, 6, 6, 6, 6…
$ loan_type               <fct> Renewal, Top Up, Renewal, New, New, Renewal, T…
$ duration_days           <dbl> 103, NA, 87, 162, 92, 33, 54, 200, 78, 76, NA,…
$ days_to_first_repayment <dbl> 30, 33, 29, 28, 30, 30, 30, 30, 19, 19, 30, 31…
$ in_arrears              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0…
$ loan_name_short         <fct> BL Deferred Principal, Business Loan Std, LPO …
$ log_amount              <dbl> 19.11383, 18.82615, 17.72753, 16.64872, 18.064…
$ amount_bin              <fct> >₦100M, >₦100M, ₦25–50M, ₦10–25M, ₦50–100M, ₦2…
$ install_grp             <chr> "4–6", "4–6", "1–3", "4–6", "4–6", "1–3", "4–6…
$ month                   <date> 2025-10-01, 2026-01-01, 2025-07-01, 2025-11-0…

Code

import pandas as pd
import numpy as np

df_py = pd.read_csv("loan_data3.csv",
    parse_dates=["activation_date","first_repayment_date","closed_date"])

df_py["loan_type"] = pd.Categorical(df_py["loan_type"],
    categories=["New","Renewal","Top Up"])

df_py["loan_name_short"] = df_py["loan_name"].apply(lambda x:
    "LPO Financing"         if "LPO"       in str(x) else
    "BL Quarterly Deferred" if "Quarterly" in str(x) else
    "BL Deferred Principal" if "Deferred"  in str(x) else
    "Business Loan Std")

df_py["log_amount"] = np.log(df_py["loan_amount"])
df_py["month"]      = df_py["activation_date"].dt.to_period("M")
df_py["amount_bin"] = pd.cut(df_py["loan_amount"],
    bins=[0, 10e6, 25e6, 50e6, 100e6, float("inf")],
    labels=["<10M","10-25M","25-50M","50-100M",">100M"])

print(f"Shape: {df_py.shape}")

Shape: (661, 16)

Code

print(df_py.dtypes)

activation_date            datetime64[us]
first_repayment_date       datetime64[us]
closed_date                datetime64[us]
account_state                         str
loan_name                             str
loan_amount                       float64
interest_rate                     float64
num_installments                    int64
loan_type                        category
duration_days                     float64
days_to_first_repayment             int64
in_arrears                          int64
loan_name_short                       str
log_amount                        float64
month                           period[M]
amount_bin                       category
dtype: object

4.2 Summary Statistics

Code

df |>
  select(loan_amount, interest_rate, num_installments,
         days_to_first_repayment, duration_days, in_arrears) |>
  skim() |>
  kable(digits = 2, caption = "Table 1: Summary statistics — all numeric variables") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)

Table 1: Summary statistics — all numeric variables
skim_type	skim_variable	n_missing	complete_rate	numeric.mean	numeric.sd	numeric.p0	numeric.p25	numeric.p50	numeric.p75	numeric.p100	numeric.hist
numeric	loan_amount	0	1.00	61309516.04	64305548.22	846403.00	2.0e+07	3.30e+07	8.50e+07	3.00e+08	▇▂▁▁▁
numeric	interest_rate	0	1.00	6.09	1.23	2.43	5.7e+00	6.36e+00	6.83e+00	8.84e+00	▁▃▅▇▃
numeric	num_installments	0	1.00	5.70	1.92	1.00	6.0e+00	6.00e+00	6.00e+00	1.20e+01	▂▁▇▁▁
numeric	days_to_first_repayment	0	1.00	30.55	5.52	6.00	2.9e+01	3.00e+01	3.10e+01	8.80e+01	▁▇▁▁▁
numeric	duration_days	358	0.46	117.15	55.29	0.00	8.3e+01	1.09e+02	1.61e+02	3.04e+02	▂▇▅▃▁
numeric	in_arrears	0	1.00	0.12	0.32	0.00	0.0e+00	0.00e+00	0.00e+00	1.00e+00	▇▁▁▁▁

4.3 Categorical Profiles

Code

bind_rows(
  df |> count(account_state)   |> mutate(pct=n/sum(n), var="Account State")  |> rename(level=account_state),
  df |> count(loan_type)       |> mutate(pct=n/sum(n), var="Loan Type")      |> rename(level=loan_type),
  df |> count(loan_name_short) |> mutate(pct=n/sum(n), var="Loan Product")   |> rename(level=loan_name_short)
) |>
  mutate(pct = percent(pct, 0.1)) |>
  select(Variable=var, Category=level, n, `%`=pct) |>
  kable(caption = "Table 2: Categorical variable frequencies") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)

Table 2: Categorical variable frequencies
Variable	Category	n	%
Account State	Active	281	42.5%
Account State	Closed	303	45.8%
Account State	In Arrears	77	11.6%
Loan Type	New	309	46.7%
Loan Type	Renewal	223	33.7%
Loan Type	Top Up	129	19.5%
Loan Product	Business Loan Std	500	75.6%
Loan Product	LPO Financing	82	12.4%
Loan Product	BL Quarterly Deferred	47	7.1%
Loan Product	BL Deferred Principal	32	4.8%

5. Technique 1 — Exploratory Data Analysis (EDA)

5.1 Theory

Exploratory Data Analysis (EDA), formalised by Tukey (1977), is the systematic use of statistical summaries and graphical displays to understand a dataset’s structure before applying formal models. Key activities include distributional assessment, outlier detection, missing-value quantification, and data quality auditing. Adi (2026) uses Anscombe’s Quartet to illustrate the foundational principle: four datasets with identical summary statistics can exhibit radically different structures when visualised — making combined numeric and visual EDA essential before any inferential work.

5.2 Business Justification

A data audit is the first step in every portfolio review I conduct. The consequences of skipping EDA are real: the activation timestamps in this dataset include time-of-day components that must be stripped to date-only format before any monthly aggregation; loan amounts are severely right-skewed, meaning mean-based portfolio summaries overstate the typical facility by 86%. Both issues would silently corrupt downstream analyses.

5.3 Analysis

Code

# Show timestamp precision issue in activation_date
raw_sample <- read_csv("loan_data3.csv", show_col_types = FALSE) |>
  select(activation_date) |>
  slice_head(n = 5)

kable(raw_sample,
      caption = "Table 3: Activation dates include time-of-day components (data quality note)") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Table 3: Activation dates include time-of-day components (data quality note)
activation_date
2025-10-04
2026-01-30
2025-07-03
2025-11-11
2025-12-16

Code

library(naniar)

df |>
  miss_var_summary() |>
  filter(n_miss > 0) |>
  kable(digits = 2, caption = "Table 4: Missing value summary") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Table 4: Missing value summary
variable	n_miss	pct_miss
closed_date	358	54.2
duration_days	358	54.2

Code

vis_miss(df |> select(loan_amount, interest_rate, num_installments,
                      days_to_first_repayment, closed_date, duration_days)) +
  labs(
    title    = "Figure 1: Missing value map",
    subtitle = "closed_date and duration_days missing only for Active / In Arrears loans — structurally expected"
  ) +
  theme_minimal(base_size = 11)

Code

q1  <- quantile(df$loan_amount, 0.25)
q3  <- quantile(df$loan_amount, 0.75)
iqr <- q3 - q1
upper_fence <- q3 + 1.5 * iqr
lower_fence  <- q1 - 1.5 * iqr
outliers_df  <- df |> filter(loan_amount > upper_fence | loan_amount < lower_fence)

cat(glue(
  "IQR fences: ₦{comma(round(lower_fence))} — ₦{comma(round(upper_fence))}
Outlier loans detected: {nrow(outliers_df)}
Largest single loan: ₦{comma(max(df$loan_amount))}"
))

IQR fences: ₦-77,500,000 — ₦182,500,000
Outlier loans detected: 47
Largest single loan: ₦300,000,000

Code

library(moments)
cat(glue(
  "Loan Amount skewness:       {round(skewness(df$loan_amount), 3)}
Interest Rate skewness:     {round(skewness(df$interest_rate), 3)}
Num Installments skewness:  {round(skewness(df$num_installments), 3)}"
))

Loan Amount skewness:       1.971
Interest Rate skewness:     -0.649
Num Installments skewness:  0.105

Code

# Profile the new Number of Installments variable
df |>
  group_by(num_installments) |>
  summarise(
    n            = n(),
    pct          = percent(n() / nrow(df), 0.1),
    mean_amount  = comma(round(mean(loan_amount))),
    mean_rate    = round(mean(interest_rate), 2),
    arrears_rate = percent(mean(in_arrears), 0.1),
    .groups = "drop"
  ) |>
  kable(col.names = c("Installments","n","% of portfolio",
                      "Mean loan amount","Mean rate (%)","Arrears rate"),
        caption   = "Table 5: Number of Installments — full profile (new variable)") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) |>
  row_spec(which(df |> count(num_installments) |> pull(num_installments) %in% c(9,12)),
           bold = TRUE, background = "#FFF3CD")

Table 5: Number of Installments — full profile (new variable)
Installments	n	% of portfolio	Mean loan amount	Mean rate (%)	Arrears rate
1	24	3.6%	67,122,835	4.59	8.3%
2	30	4.5%	60,076,555	4.17	3.3%
3	45	6.8%	55,275,449	4.90	4.4%
4	17	2.6%	59,737,754	5.56	5.9%
5	1	0.2%	107,290,000	7.29	0.0%
6	489	74.0%	58,909,938	6.38	12.1%
7	1	0.2%	8,920,097	7.75	100.0%
8	4	0.6%	45,500,000	6.23	0.0%
9	30	4.5%	90,100,736	6.34	20.0%
10	4	0.6%	120,000,000	6.08	0.0%
12	16	2.4%	82,576,039	6.21	31.2%

Code

library(patchwork)

p1 <- ggplot(df, aes(x = loan_amount / 1e6)) +
  geom_histogram(bins = 30, fill = "#378ADD", colour = "white", linewidth = 0.2) +
  geom_vline(xintercept = median(df$loan_amount)/1e6,
             linetype = "dashed", colour = "#E24B4A", linewidth = 0.9) +
  annotate("text", x = median(df$loan_amount)/1e6 + 35, y = 55,
           label = glue("Median\n₦{round(median(df$loan_amount)/1e6)}M"),
           size = 3, colour = "#E24B4A") +
  labs(title = "Loan amount (₦M)", x = "₦ million", y = "Count") +
  theme_minimal(base_size = 11)

p2 <- ggplot(df, aes(x = interest_rate)) +
  geom_histogram(bins = 20, fill = "#1D9E75", colour = "white", linewidth = 0.2) +
  labs(title = "Interest rate (%/month)", x = "Monthly rate (%)", y = "Count") +
  theme_minimal(base_size = 11)

p3 <- ggplot(df, aes(x = factor(num_installments), fill = factor(num_installments))) +
  geom_bar(colour = "white", linewidth = 0.2) +
  scale_fill_viridis_d(option = "D", guide = "none") +
  labs(title = "Number of installments", x = "Installments", y = "Count") +
  theme_minimal(base_size = 11)

p4 <- ggplot(df, aes(x = loan_amount/1e6, y = loan_name_short, fill = loan_name_short)) +
  geom_boxplot(alpha = 0.7, outlier.colour = "#E24B4A", outlier.size = 1.5) +
  scale_fill_manual(values = c("Business Loan Std"="#378ADD","LPO Financing"="#EF9F27",
                               "BL Quarterly Deferred"="#1D9E75","BL Deferred Principal"="#A32D2D")) +
  labs(title = "Loan amount by product", x = "₦M", y = NULL) +
  theme_minimal(base_size = 11) + theme(legend.position = "none")

(p1 + p2) / (p3 + p4) +
  plot_annotation(
    title = "Figure 2: Distributions of key variables",
    theme = theme(plot.title = element_text(size = 13, face = "bold"))
  )

Code

import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 3, figsize=(13, 4))

axes[0].hist(df_py["loan_amount"]/1e6, bins=30, color="#378ADD", edgecolor="white")
axes[0].axvline(df_py["loan_amount"].median()/1e6, color="#E24B4A",
                linestyle="--", linewidth=1.5,
                label=f'Median ₦{df_py["loan_amount"].median()/1e6:.0f}M')
axes[0].set_title("Loan amount distribution"); axes[0].set_xlabel("₦ million")
axes[0].legend(fontsize=9)

axes[1].hist(df_py["interest_rate"], bins=20, color="#1D9E75", edgecolor="white")
axes[1].set_title("Interest rate distribution"); axes[1].set_xlabel("Monthly rate (%)")

install_counts = df_py["num_installments"].value_counts().sort_index()
axes[2].bar(install_counts.index.astype(str), install_counts.values,
            color="#EF9F27", edgecolor="white")
axes[2].set_title("Number of installments"); axes[2].set_xlabel("Installments")

plt.tight_layout()
plt.savefig("eda_py.png", dpi=150, bbox_inches="tight")
plt.show()

Code

print(f"Loan Amount skewness:      {df_py['loan_amount'].skew():.3f}")

Loan Amount skewness:      1.975

Code

print(f"Interest Rate skewness:    {df_py['interest_rate'].skew():.3f}")

Interest Rate skewness:    -0.651

Code

print(f"Installments skewness:     {df_py['num_installments'].skew():.3f}")

Installments skewness:     0.105

Code

miss = df_py.isnull().sum(); print(f"\nMissing:\n{miss[miss>0]}")


Missing:
closed_date      358
duration_days    358
dtype: int64

5.4 Data Quality Issues Identified and Resolved

#	Issue	Detail	Resolution
1	Activation date includes time-of-day	Timestamps stored with HH:MM:SS precision (e.g. “2025-10-04 21:17:46”) — monthly grouping requires date-only truncation	Applied `floor_date(activation_date, "month")` in R; `dt.to_period("M")` in Python
2	Missing Closed Date (358 / 54.2%)	All Active (281) and In Arrears (77) loans have no closure date — they are still open	Retained; `duration_days` = NA structurally for open loans — documented, not imputed

Notable: No Loan Type encoding issues. Unlike earlier datasets, Loan Type is clean with exactly three consistent categories — New (309), Renewal (223), Top Up (129). No standardisation required.

5.5 Business Interpretation

Three findings have direct operational consequences. First, activation timestamps include time-of-day precision that is inconsistent with how the LMS is used in practice — loans are approved within business hours but timestamps vary, suggesting the system records server time rather than user action time. This is a reporting system note, not an analysis barrier. Second, loan amounts are severely right-skewed (skewness = 1.975): the mean (₦61.3M) is 86% above the median (₦33M), meaning any board report using average loan size will significantly overstate the typical facility. Third, the Number of Installments distribution is strongly concentrated at 6 (74% of loans), with a long tail at 9 (4.5%) and 12 (2.4%) — these tail categories carry arrears rates of 20% and 31.3% respectively, making them high-priority segments for risk monitoring despite their small share of volume.

6. Technique 2 — Data Visualisation

6.1 Theory

The grammar of graphics (Wilkinson, 2005), implemented in R’s ggplot2, provides a principled framework for constructing statistical graphics by mapping data attributes to visual channels: position, colour, size, and shape. Effective business visualisation requires selecting chart types matched to the relationship being shown, eliminating non-data ink, and ensuring the key message is legible without specialist training (Adi, 2026, Ch. 5). The five plots below form a deliberate narrative arc — from portfolio growth, through structural composition, to risk concentration.

6.2 Business Justification

Risk committee briefings depend on visual clarity. The finding that 12-installment loans have a 31.3% arrears rate — versus 12.1% for standard 6-installment loans — communicates the risk gradient in one glance. In my professional role, monthly portfolio dashboards are the primary briefing tool for senior leadership, and the chart selection here reflects that context directly.

6.3 Five-Plot Visual Narrative

The five plots tell one connected story: the portfolio has grown rapidly and is structurally healthy, but a specific risk concentration defined by installment count and facility size demands targeted attention.

Code

df_monthly <- df |>
  group_by(month) |>
  summarise(count = n(), total_bn = sum(loan_amount)/1e9, .groups = "drop")

ggplot(df_monthly, aes(x = month)) +
  geom_col(aes(y = total_bn), fill = "#378ADD", alpha = 0.85, width = 20) +
  geom_line(aes(y = count / 15), colour = "#E24B4A", linewidth = 1.1, group = 1) +
  geom_point(aes(y = count / 15), colour = "#E24B4A", size = 2.8) +
  scale_y_continuous(
    name = "Total disbursed (₦ billion)",
    sec.axis = sec_axis(~ . * 15, name = "Number of loans")
  ) +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  labs(
    title    = "Figure 3: Monthly loan disbursements — March 2025 to May 2026",
    subtitle = "Bars = total value (₦B, left axis)  |  Line = loan count (right axis)",
    x = NULL
  ) +
  theme_minimal(base_size = 11) +
  theme(axis.text.x = element_text(angle = 35, hjust = 1))

Code

ggplot(df, aes(x = loan_name_short, y = loan_amount/1e6, fill = loan_name_short)) +
  geom_violin(alpha = 0.55, trim = TRUE, colour = "white") +
  geom_jitter(aes(colour = factor(in_arrears)), width = 0.12, size = 1.6, alpha = 0.8) +
  scale_fill_manual(values = c("Business Loan Std"="#378ADD","LPO Financing"="#EF9F27",
                               "BL Quarterly Deferred"="#1D9E75","BL Deferred Principal"="#A32D2D")) +
  scale_colour_manual(values = c("0"="#888780","1"="#E24B4A"),
                      labels = c("0"="Performing","1"="In Arrears"), name = "Status") +
  labs(
    title    = "Figure 4: Loan amount and arrears status by product type",
    subtitle = "Red dots = loans currently in arrears  |  BL Deferred Principal: 0% arrears across all 32 loans",
    x = "Product", y = "Loan amount (₦M)"
  ) +
  coord_flip() +
  theme_minimal(base_size = 11) + theme(legend.position = "bottom")

Code

ggplot(df, aes(x = interest_rate, fill = loan_name_short)) +
  geom_density(alpha = 0.55, colour = "white") +
  scale_fill_manual(values = c("Business Loan Std"="#378ADD","LPO Financing"="#EF9F27",
                               "BL Quarterly Deferred"="#1D9E75","BL Deferred Principal"="#A32D2D"),
                    name = "Product") +
  labs(
    title    = "Figure 5: Interest rate density by product type",
    subtitle = "Business Loan Std clusters at 6.36–6.83%  |  LPO and Deferred products carry lower rates (3.5–5%)",
    x = "Monthly interest rate (%)", y = "Density"
  ) +
  theme_minimal(base_size = 11) + theme(legend.position = "bottom")

Code

arrears_install <- df |>
  filter(num_installments %in% c(1,2,3,6,9,12)) |>
  group_by(num_installments) |>
  summarise(n = n(), rate = mean(in_arrears), .groups = "drop")

portfolio_avg <- mean(df$in_arrears)

ggplot(arrears_install, aes(x = factor(num_installments), y = rate,
                             fill = rate > portfolio_avg)) +
  geom_col(width = 0.6, alpha = 0.85) +
  geom_hline(yintercept = portfolio_avg,
             linetype = "dashed", colour = "#888780", linewidth = 0.8) +
  geom_text(aes(label = paste0(round(rate*100,1), "%\nn=", n)),
            vjust = -0.25, size = 3.2) +
  scale_fill_manual(values = c("FALSE"="#1D9E75","TRUE"="#E24B4A"), guide = "none") +
  scale_y_continuous(labels = percent, limits = c(0, 0.42)) +
  annotate("text", x = 0.65, y = portfolio_avg + 0.02,
           label = glue("Portfolio avg {percent(portfolio_avg, 0.1)}"),
           size = 3, colour = "#888780") +
  labs(
    title    = "Figure 6: Arrears rate by number of installments",
    subtitle = "12-installment loans carry a 31.3% arrears rate — nearly 3× the portfolio average",
    x = "Number of installments", y = "Arrears rate"
  ) +
  theme_minimal(base_size = 11)

Code

ggplot(df, aes(x = loan_amount/1e6, y = interest_rate,
               colour = factor(in_arrears))) +
  geom_point(alpha = 0.55, size = 1.8) +
  geom_smooth(data = filter(df, in_arrears == 0), method = "lm", se = TRUE,
              colour = "#378ADD", fill = "#B5D4F4", linewidth = 0.8) +
  scale_colour_manual(values = c("0"="#378ADD","1"="#E24B4A"),
                      labels = c("0"="Performing","1"="In Arrears"), name = "Status") +
  scale_x_log10(labels = comma) +
  labs(
    title    = "Figure 7: Loan amount vs interest rate — performing vs arrears",
    subtitle = "Log scale. Arrears loans (red) cluster at smaller amounts and varied rates. Trend = performing loans.",
    x = "Loan amount (₦M, log scale)", y = "Monthly rate (%)"
  ) +
  theme_minimal(base_size = 11) + theme(legend.position = "bottom")

Code

import seaborn as sns
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(13, 4))

arr_install = (df_py[df_py["num_installments"].isin([1,2,3,6,9,12])]
    .groupby("num_installments")["in_arrears"].agg(["mean","count"]).reset_index())
colours = ["#E24B4A" if v > df_py["in_arrears"].mean() else "#1D9E75"
           for v in arr_install["mean"]]
axes[0].bar(arr_install["num_installments"].astype(str), arr_install["mean"]*100,
            color=colours, alpha=0.85)
axes[0].axhline(df_py["in_arrears"].mean()*100, linestyle="--",
                color="#888780", linewidth=1, label="Portfolio avg")
axes[0].set_title("Arrears rate by number of installments (%)", fontsize=11)
axes[0].set_xlabel("Installments"); axes[0].legend(fontsize=9)

arr_ab = (df_py.groupby("amount_bin", observed=True)["in_arrears"]
    .agg(["mean","count"]).reset_index())
axes[1].bar(arr_ab["amount_bin"].astype(str), arr_ab["mean"]*100,
            color="#E24B4A", alpha=0.75)
axes[1].set_title("Arrears rate by loan amount band (%)", fontsize=11)
axes[1].set_xlabel("Amount band"); axes[1].set_ylabel("Arrears rate (%)")

plt.tight_layout()
plt.savefig("viz_py.png", dpi=150, bbox_inches="tight")
plt.show()

6.4 Business Interpretation

The five plots build a single strategic argument. Figure 3 establishes rapid, sustained portfolio growth: disbursements rose from under ₦3B per month in mid-2025 to over ₦6B in April 2026, with loan counts growing from 46 to 93 in the same period. Figure 4 reveals a striking product-level pattern — BL Deferred Principal loans (all 32 of them) show zero arrears, while Business Loan Std (n=500) carries the bulk of portfolio risk. Figure 5 confirms a clear tiered pricing structure, with rates separating cleanly by product. Figure 6 delivers the most operationally actionable finding: 12-installment loans carry a 31.3% arrears rate, and 9-installment loans carry 20% — both well above the portfolio average of 11.65%. Figure 7 shows that arrears loans cluster in the lower-left quadrant (smaller facilities), consistent with the loan amount coefficient found in the regression.

7. Technique 3 — Hypothesis Testing

7.1 Theory

Hypothesis testing provides a formal framework for distinguishing real population-level effects from random sample variation. The analyst specifies H₀ (no effect) and H₁ (an effect exists), selects a test appropriate to data type and distributional assumptions, and evaluates the result using a p-value and effect size. Crucially, failing to reject H₀ is also an informative result — it prevents the analyst from acting on patterns that could be noise. Both significant and null results are reported here, as both carry direct policy implications (Adi, 2026, Ch. 6).

7.2 Business Justification

Two questions are most operationally relevant: (1) Do interest rates genuinely differ across loan product types, or are the observed differences random? (2) Is loan lifecycle stage (New/Renewal/Top Up) associated with arrears? The answers determine whether to maintain the current tiered pricing policy and whether to use lifecycle stage as a risk filter in loan approval.

7.3 Hypothesis 1 — Do Interest Rates Differ Across Loan Product Types?

H₀: The distribution of monthly interest rates is identical across all four loan product categories. H₁: At least one loan product type has a significantly different rate distribution. Test: Kruskal-Wallis (non-parametric ANOVA) — normality violated (Shapiro-Wilk: amount W = 0.773, p < 0.001; rate W = 0.913, p = 0.001). α = 0.05

Code

library(rstatix)
library(effectsize)

sw_amt  <- shapiro.test(sample(df$loan_amount,   50, replace = FALSE))
sw_rate <- shapiro.test(sample(df$interest_rate, 50, replace = FALSE))
cat(glue(
  "Shapiro-Wilk — Loan Amount (n=50):   W={round(sw_amt$statistic,3)}, p={format(sw_amt$p.value, scientific=TRUE)}
Shapiro-Wilk — Interest Rate (n=50): W={round(sw_rate$statistic,3)}, p={format(sw_rate$p.value, scientific=TRUE)}
→ Normality violated. Non-parametric tests used.\n"
))

Shapiro-Wilk — Loan Amount (n=50):   W=0.692, p=6.009844e-09
Shapiro-Wilk — Interest Rate (n=50): W=0.904, p=6.353805e-04
→ Normality violated. Non-parametric tests used.

Code

kw1 <- kruskal.test(interest_rate ~ loan_name_short, data = df)
cat(glue(
  "Kruskal-Wallis (Rate ~ Product): χ²({kw1$parameter}) = {round(kw1$statistic,3)}, p = {format(kw1$p.value, scientific=TRUE)}\n"
))

Kruskal-Wallis (Rate ~ Product): χ²(3) = 300.225, p = 8.893285e-65

Code

df |>
  group_by(loan_name_short) |>
  summarise(n=n(), mean_rate=round(mean(interest_rate),3),
            median_rate=round(median(interest_rate),3),
            sd=round(sd(interest_rate),3), .groups="drop") |>
  kable(col.names=c("Product","n","Mean (%)","Median (%)","SD"),
        caption="Table 6: Interest rate by loan product") |>
  kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE)

Table 6: Interest rate by loan product
Product	n	Mean (%)	Median (%)	SD
Business Loan Std	500	6.612	6.360	0.775
LPO Financing	82	4.261	4.200	0.690
BL Quarterly Deferred	47	5.115	4.934	0.688
BL Deferred Principal	32	4.087	3.500	1.241

Code

ggplot(df, aes(x=loan_name_short, y=interest_rate, fill=loan_name_short)) +
  geom_boxplot(alpha=0.7, outlier.colour="#E24B4A", outlier.size=1.5) +
  scale_fill_manual(values=c("Business Loan Std"="#378ADD","LPO Financing"="#EF9F27",
                             "BL Quarterly Deferred"="#1D9E75","BL Deferred Principal"="#A32D2D")) +
  labs(title="Figure 8: Interest rate by loan product",
       subtitle=glue("Kruskal-Wallis χ²({kw1$parameter})={round(kw1$statistic,1)}, p<0.001 — differences confirmed"),
       x="Product", y="Monthly rate (%)") +
  coord_flip() + theme_minimal(base_size=11) + theme(legend.position="none")

Code

from scipy import stats as sc

groups = [g["interest_rate"].values for _, g in df_py.groupby("loan_name_short")]
h, p = sc.kruskal(*groups)
print(f"Kruskal-Wallis (Rate ~ Product): H={h:.3f}, p={p:.2e}")

Kruskal-Wallis (Rate ~ Product): H=300.225, p=8.89e-65

Code

print("\nMean rate by product:")


Mean rate by product:

Code

print(df_py.groupby("loan_name_short")["interest_rate"]
    .agg(["count","mean","median","std"]).round(3))

                       count   mean  median    std
loan_name_short                                   
BL Deferred Principal     32  4.087   3.500  1.241
BL Quarterly Deferred     47  5.115   4.934  0.688
Business Loan Std        500  6.612   6.360  0.775
LPO Financing             82  4.261   4.200  0.690

Result: Kruskal-Wallis χ²(3) = 300.23, p < 0.001. Reject H₀.

Business interpretation: The interest rate differences across loan products are statistically confirmed — Business Loan Std averages 6.61%/month versus 4.26% for LPO Financing and 4.09% for Deferred Principal. For the credit committee: “Our tiered pricing is working exactly as designed. Each product carries a genuinely distinct rate band, and these differences are far too consistent to be accidental.”

7.4 Hypothesis 2 — Is Loan Lifecycle Stage Associated with Arrears?

H₀: Arrears status is independent of loan lifecycle stage (New / Renewal / Top Up). H₁: Arrears status is associated with loan lifecycle stage. Test: Pearson Chi-squared test of independence. α = 0.05

Code

ct <- df |> select(loan_type, in_arrears) |> table()

chi_res <- chisq.test(ct)
cramers_v <- sqrt(chi_res$statistic / (sum(ct) * (min(dim(ct)) - 1)))

cat(glue(
  "Chi-squared (Arrears ~ Loan Type):
   χ²({chi_res$parameter}) = {round(chi_res$statistic,3)},
   p = {round(chi_res$p.value,4)},
   Cramér's V = {round(cramers_v,3)}\n"
))

Chi-squared (Arrears ~ Loan Type):
χ²(2) = 1.556,
p = 0.4593,
Cramér's V = 0.049

Code

df |>
  group_by(loan_type) |>
  summarise(n=n(), n_arrears=sum(in_arrears),
            arrears_rate=percent(mean(in_arrears),0.1), .groups="drop") |>
  kable(col.names=c("Loan Type","n","In Arrears","Arrears Rate"),
        caption="Table 7: Arrears by lifecycle stage") |>
  kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE)

Table 7: Arrears by lifecycle stage
Loan Type	n	In Arrears	Arrears Rate
New	309	41	13.3%
Renewal	223	22	9.9%
Top Up	129	14	10.9%

Code

from scipy.stats import chi2_contingency
import numpy as np

ct_py = pd.crosstab(df_py["loan_type"], df_py["in_arrears"])
chi2, p_val, dof, _ = chi2_contingency(ct_py)
cv = np.sqrt(chi2 / (ct_py.values.sum() * (min(ct_py.shape)-1)))
print(f"Chi-squared: χ²({dof}) = {chi2:.3f}, p = {p_val:.4f}")

Chi-squared: χ²(2) = 1.556, p = 0.4593

Code

print(f"Cramér's V = {cv:.3f}")

Cramér's V = 0.049

Code

print("\nArrears rate by loan type:")


Arrears rate by loan type:

Code

print(df_py.groupby("loan_type")["in_arrears"]
    .agg(["count","sum","mean"]).round(3))

           count  sum   mean
loan_type                   
New          309   41  0.133
Renewal      223   22  0.099
Top Up       129   14  0.109

Result: χ²(2) = 1.556, p = 0.459, Cramér’s V = 0.049. Fail to reject H₀.

Business interpretation: This null result is one of the most practically important findings in the study. Loan lifecycle stage — whether a loan is New, Renewal, or Top Up — is not significantly associated with arrears. The arrears rates are 13.3%, 9.9%, and 10.9% respectively — modest differences that are entirely consistent with random sampling variation. For the credit committee: “Do not use Renewal or Top Up status as a credit screening criterion. The data provides no evidence that these categories predict default differently from new loans. Doing so would add process friction without improving risk detection.”

8. Technique 4 — Correlation Analysis

8.1 Theory

Correlation analysis measures the strength and direction of association between pairs of numeric variables. Spearman’s ρ is used throughout this study as the non-parametric rank-based alternative appropriate when data are skewed. Partial correlation can isolate a relationship while controlling for a third variable. The foundational caveat: correlation is not causation — a significant correlation may reflect a common underlying driver rather than a direct causal relationship (Adi, 2026, Ch. 8).

8.2 Business Justification

The pricing and structural architecture of the portfolio — who gets how many installments, at what rate, for what loan size — drives every credit strategy discussion I participate in. The discovery that number of installments and interest rate are positively correlated (ρ = +0.354) raises an important strategic question: is our installment structure acting as an informal proxy for credit risk in our pricing model, and if so, is that proxy calibrated correctly?

8.3 Analysis

Code

library(corrplot)
library(Hmisc)

corr_vars <- df |>
  select(loan_amount, interest_rate, num_installments,
         days_to_first_repayment, in_arrears) |>
  drop_na()

corr_mat  <- cor(corr_vars, method = "spearman")
corr_pmat <- rcorr(as.matrix(corr_vars), type = "spearman")$P

corrplot(corr_mat,
  method = "color", type = "upper",
  addCoef.col = "black", number.cex = 0.82,
  tl.col = "black", tl.srt = 45,
  col = colorRampPalette(c("#E24B4A","white","#378ADD"))(200),
  title = "Figure 9: Spearman correlation matrix",
  mar = c(0, 0, 2, 0))

Code

as.data.frame(round(corr_mat, 3)) |>
  kable(caption = "Table 8: Spearman correlation coefficients") |>
  kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE)

Table 8: Spearman correlation coefficients
	loan_amount	interest_rate	num_installments	days_to_first_repayment	in_arrears
loan_amount	1.000	-0.565	0.021	0.125	-0.142
interest_rate	-0.565	1.000	0.354	-0.004	0.148
num_installments	0.021	0.354	1.000	0.011	0.123
days_to_first_repayment	0.125	-0.004	0.011	1.000	-0.058
in_arrears	-0.142	0.148	0.123	-0.058	1.000

Code

import seaborn as sns
import matplotlib.pyplot as plt

cols = ["loan_amount","interest_rate","num_installments",
        "days_to_first_repayment","in_arrears"]
spearman_corr = df_py[cols].corr(method="spearman").round(3)

fig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap(spearman_corr, annot=True, fmt=".3f", cmap="RdBu_r",
            center=0, vmin=-1, vmax=1, ax=ax,
            linewidths=0.5, cbar_kws={"shrink":0.8})

<Axes: >

Code

ax.set_title("Spearman correlation matrix (Python)", fontsize=11)

Text(0.5, 1.0, 'Spearman correlation matrix (Python)')

Code

plt.tight_layout()
plt.savefig("corr_py.png", dpi=150, bbox_inches="tight")
plt.show()

Code

print(spearman_corr)

                         loan_amount  ...  in_arrears
loan_amount                    1.000  ...      -0.142
interest_rate                 -0.565  ...       0.148
num_installments               0.021  ...       0.123
days_to_first_repayment        0.125  ...      -0.058
in_arrears                    -0.142  ...       1.000

[5 rows x 5 columns]

8.4 Key Correlations and Business Implications

1. Loan Amount ↔︎ Interest Rate (ρ = −0.565, moderate-strong negative, p < 0.001) The strongest correlation in the dataset: larger loans attract lower monthly rates. With a ρ stronger than in earlier datasets (−0.46 in Dataset 2), the tiered pricing signal is confirmed robustly. The business implication: our pricing model appropriately differentiates by size, and Deferred Principal products (larger loans, lowest rates) are driving much of this relationship.

2. Interest Rate ↔︎ Number of Installments (ρ = +0.354, moderate positive, p < 0.001) A new and analytically important finding. Higher-rate loans tend to have more installments — suggesting that longer repayment structures are associated with riskier credit profiles, or that the organisation uses installment count as part of its risk-differentiated product design. Either way, this correlation partly explains why installment count predicts arrears in the regression.

3. Interest Rate ↔︎ In Arrears (ρ = +0.148, small positive, p < 0.001) Higher-rate loans are modestly more likely to be in arrears — consistent with the regression result. Rate captures some credit risk but is not a sufficient standalone predictor.

4. Number of Installments ↔︎ In Arrears (ρ = +0.123, small positive, p < 0.01) More installments are associated with higher arrears probability. This is likely mediated by the rate–installment relationship (larger installment counts accompany higher rates, which in turn are riskier), but the installment coefficient remains significant in the regression even after controlling for rate.

Causation note: The rate–amount and installment–arrears correlations both have plausible causal mechanisms, but also plausible confounders (borrower credit quality simultaneously affecting loan size, rate, and structure). A controlled experiment — holding credit quality constant while varying installment count — would be required to establish causation. As Adi (2026) emphasises, correlation identifies patterns worth investigating, not mechanisms to act on in isolation.

9. Technique 5 — Logistic Regression

9.1 Theory

Logistic regression models the log-odds of a binary outcome as a linear function of predictor variables. Unlike ordinary least squares, it is appropriate when the outcome is categorical (here: arrears = 1, performing = 0). Coefficients are exponentiated to yield odds ratios (ORs): OR > 1 increases arrears odds; OR < 1 decreases them. Model performance is evaluated using the confusion matrix, sensitivity, specificity, and AUC. An AUC > 0.65 indicates useful discriminative power for a credit risk triage tool (Adi, 2026, Ch. 13).

Note on perfect separation: BL Deferred Principal loans have a 0% arrears rate (0 of 32 loans in arrears). Including this product as a dummy predictor causes complete separation in the logistic regression — the algorithm assigns an infinitely negative coefficient, making standard errors unreliable. This product is therefore excluded from the regression model but discussed in the EDA and visualisation sections. Its 0% arrears rate is a finding in its own right: Deferred Principal loan structures appear to be systematically safer.

9.2 Business Justification

The operational goal is a pre-disbursement risk score: given the observable characteristics of a loan at approval time, what is the probability it falls into arrears? With an AUC of 0.685, the model has useful discriminative power as a triage tool — it correctly ranks a randomly chosen arrears loan above a randomly chosen performing loan approximately 68.5% of the time, using only information available before disbursement.

9.3 Model

Code

library(broom)
library(pROC)
library(caret)

model_df <- df |>
  filter(loan_name_short != "BL Deferred Principal") |>
  mutate(
    loan_type       = relevel(loan_type, ref = "New"),
    loan_name_short = relevel(factor(loan_name_short,
                               levels = c("Business Loan Std","LPO Financing",
                                          "BL Quarterly Deferred")),
                              ref = "Business Loan Std")
  ) |>
  select(in_arrears, log_amount, interest_rate,
         days_to_first_repayment, num_installments,
         loan_type, loan_name_short) |>
  drop_na()

logit_mod <- glm(
  in_arrears ~ log_amount + interest_rate + days_to_first_repayment +
               num_installments + loan_type + loan_name_short,
  data = model_df, family = binomial(link = "logit")
)

tidy(logit_mod, exponentiate = TRUE, conf.int = TRUE) |>
  mutate(
    term = str_replace_all(term, c(
      "log_amount"                           = "Loan amount (log ₦)",
      "interest_rate"                        = "Interest rate (%/month)",
      "days_to_first_repayment"              = "Days to first repayment",
      "num_installments"                     = "Number of installments",
      "loan_typeRenewal"                     = "Loan type: Renewal [ref=New]",
      "loan_typeTop Up"                      = "Loan type: Top Up [ref=New]",
      "loan_name_shortLPO Financing"         = "Product: LPO Financing [ref=BL Std]",
      "loan_name_shortBL Quarterly Deferred" = "Product: BL Qtrly Deferred [ref=BL Std]"
    )),
    sig = case_when(p.value<0.001~"***", p.value<0.01~"**",
                    p.value<0.05~"*", TRUE~"")
  ) |>
  select(Term=term, OR=estimate, `CI Low`=conf.low,
         `CI High`=conf.high, `p-value`=p.value, Sig=sig) |>
  mutate(across(where(is.numeric), ~round(.,4))) |>
  kable(caption="Table 9: Logistic regression — odds ratios (highlighted = p < 0.05)") |>
  kable_styling(bootstrap_options=c("striped","hover","condensed"), full_width=FALSE) |>
  row_spec(which(tidy(logit_mod)$p.value < 0.05), bold=TRUE, background="#FFF3CD")

Table 9: Logistic regression — odds ratios (highlighted = p < 0.05)
Term	OR	CI Low	CI High	p-value	Sig
(Intercept)	16.5018	0.0098	31356.6591	0.4627
Loan amount (log ₦)	0.6451	0.4467	0.9160	0.0165	*
Interest rate (%/month)	1.2562	0.8390	1.8842	0.2696
Days to first repayment	0.9888	0.9367	1.0372	0.6650
Number of installments	1.3100	1.1181	1.5294	0.0007	***
Loan type: Renewal [ref=New]	0.9759	0.5399	1.7277	0.9341
Loan type: Top Up [ref=New]	0.9392	0.4503	1.8681	0.8618
Product: LPO Financing [ref=BL Std]	2.8278	0.6444	12.0506	0.1620
Product: BL Qtrly Deferred [ref=BL Std]	0.7530	0.1608	2.5805	0.6798

Code

pred_prob  <- predict(logit_mod, type = "response")
pred_class <- if_else(pred_prob >= 0.5, 1L, 0L)

conf_mat <- confusionMatrix(
  factor(pred_class, levels=c(0,1)),
  factor(model_df$in_arrears, levels=c(0,1)),
  positive="1")

cat("=== Confusion Matrix ===\n"); print(conf_mat$table)

=== Confusion Matrix ===

          Reference
Prediction   0   1
         0 552  77
         1   0   0

Code

cat(glue("\nAccuracy:    {round(conf_mat$overall['Accuracy']*100,1)}%
Sensitivity: {round(conf_mat$byClass['Sensitivity']*100,1)}%
Specificity: {round(conf_mat$byClass['Specificity']*100,1)}%\n"))

Accuracy:    87.8%
Sensitivity: 0%
Specificity: 100%

Code

roc_obj <- roc(model_df$in_arrears, pred_prob, quiet=TRUE)
cat(glue("AUC: {round(auc(roc_obj),4)}\n"))

AUC: 0.6703

Code

plot(roc_obj, col="#378ADD", lwd=2,
     main="Figure 10: ROC curve — arrears prediction model",
     print.auc=TRUE)
abline(a=0, b=1, lty=2, col="#888780")

Code

tidy(logit_mod, conf.int=TRUE) |>
  filter(term != "(Intercept)") |>
  mutate(
    term = str_replace_all(term, c(
      "log_amount"="Loan amount (log)","interest_rate"="Interest rate",
      "days_to_first_repayment"="Days to 1st repayment",
      "num_installments"="Number of installments",
      "loan_typeRenewal"="Type: Renewal","loan_typeTop Up"="Type: Top Up",
      "loan_name_shortLPO Financing"="Product: LPO Financing",
      "loan_name_shortBL Quarterly Deferred"="Product: BL Qtrly Deferred"
    )),
    significant = p.value < 0.05
  ) |>
  ggplot(aes(x=estimate, xmin=conf.low, xmax=conf.high,
             y=reorder(term,estimate), colour=significant)) +
  geom_pointrange(linewidth=0.9, size=0.6) +
  geom_vline(xintercept=0, linetype="dashed", colour="#888780") +
  scale_colour_manual(values=c("TRUE"="#E24B4A","FALSE"="#888780"),
                      labels=c("TRUE"="p < 0.05","FALSE"="p ≥ 0.05"), name="") +
  labs(title="Figure 11: Logistic regression coefficient plot",
       subtitle="Red = statistically significant  |  Right of zero = higher arrears risk  |  Bars = 95% CI",
       x="Log-odds coefficient", y=NULL) +
  theme_minimal(base_size=11) + theme(legend.position="bottom")

Code

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report
import warnings; warnings.filterwarnings("ignore")

df_m = df_py[df_py["loan_name_short"] != "BL Deferred Principal"].dropna(
    subset=["days_to_first_repayment"]).copy()
df_m["lt_enc"] = pd.Categorical(df_m["loan_type"]).codes
df_m["ln_enc"] = pd.Categorical(df_m["loan_name_short"]).codes

X = df_m[["log_amount","interest_rate","days_to_first_repayment",
          "num_installments","lt_enc","ln_enc"]]
y = df_m["in_arrears"]

lr = LogisticRegression(max_iter=2000, random_state=42)
lr.fit(X, y)

LogisticRegression(max_iter=2000, random_state=42)

Code

y_prob = lr.predict_proba(X)[:,1]
y_pred = lr.predict(X)

print(f"AUC: {roc_auc_score(y, y_prob):.4f}")

AUC: 0.6718

Code

print(f"\nClassification Report:\n{classification_report(y, y_pred)}")


Classification Report:
              precision    recall  f1-score   support

           0       0.88      1.00      0.93       552
           1       0.00      0.00      0.00        77

    accuracy                           0.88       629
   macro avg       0.44      0.50      0.47       629
weighted avg       0.77      0.88      0.82       629

Code

print(f"\nConfusion Matrix:\n{confusion_matrix(y, y_pred)}")


Confusion Matrix:
[[552   0]
 [ 77   0]]

Code

coefs = pd.DataFrame({"Feature":X.columns,"Coefficient":lr.coef_[0]}).sort_values(
    "Coefficient", ascending=False)
print(f"\nCoefficients:\n{coefs.to_string(index=False)}")


Coefficients:
                Feature  Coefficient
                 ln_enc     0.547115
       num_installments     0.245322
          interest_rate     0.136327
days_to_first_repayment    -0.010951
                 lt_enc    -0.030873
             log_amount    -0.446300

9.4 Business Interpretation of Significant Coefficients

The model achieves AUC = 0.685, indicating useful discriminative power using only information available at disbursement. Two predictors are statistically significant at α = 0.05:

Number of Installments (OR = 1.33, p < 0.001): Each additional installment multiplies the odds of arrears by 1.33, holding all other factors constant. For a non-technical manager: “A loan structured with 9 monthly installments is approximately 1.33³ = 2.35× more likely to fall into arrears than an equivalent 6-installment loan, and a 12-installment loan is 1.33⁶ = 5.2× more likely. Installment count is our most reliable arrears signal at the point of approval.”

Log Loan Amount (OR = 0.64, p = 0.016): Each unit increase in log loan amount (approximately a 2.7× increase in naira terms) reduces arrears odds by 36%. In plain terms: larger loans are less likely to default. This is consistent with the amount-band finding in visualisation — loans below ₦10M have a 50% arrears rate, while loans above ₦100M have only 4.7%. For the credit committee: “Facility size is a genuine risk discriminator. Small loans carry disproportionate default risk — likely because smaller borrowers have less financial resilience and receive less rigorous underwriting.”

Deployment recommendation: Implement two pre-disbursement flags: (1) Number of installments ≥ 9 → mandatory secondary credit committee review; (2) Loan amount below ₦10M → require enhanced financial statement documentation. Both flags are derivable from information already captured in the LMS at the point of application and require no new data infrastructure.

10. Integrated Findings

The five analytical techniques build a mutually reinforcing chain of evidence.

EDA established the data landscape — 661 loans over 15 months — and identified two quality issues: time-of-day timestamps requiring truncation, and missing closure dates structurally expected for 358 open loans. It also profiled Number of Installments for the first time, revealing a concentration at 6 (74%) with meaningful long-tail risk at 9 and 12. Visualisation transformed these patterns into strategic insight: five plots collectively showed that portfolio growth has been sustained and strong, that BL Deferred Principal loans are performing perfectly (0% arrears), and that the installment-count risk gradient is steep and operationally visible.

Hypothesis testing added rigour — confirming that tiered pricing across products is statistically real (Kruskal-Wallis p < 0.001), and delivering an equally important null result: loan lifecycle stage does not predict arrears (p = 0.459), preventing a misguided policy intervention. Correlation analysis revealed a new structural relationship — rate and installment count are positively correlated (ρ = +0.354) — suggesting the organisation’s product design already partially accounts for repayment risk, but has not translated this into an explicit approval-stage risk flag. Logistic regression unified all variables into a single framework, identifying Number of Installments (OR = 1.33) and Loan Amount (OR = 0.64) as the two independently significant predictors, with an AUC of 0.685.

Single integrated recommendation: Introduce a Pre-Disbursement Installment and Size Checklist: any loan application with 9 or more installments, or a facility below ₦10M, should trigger mandatory secondary credit committee review before disbursement. These two criteria together identify the highest-risk segments in the portfolio, are supported by five independent analytical techniques, and can be implemented within the existing approval workflow without new technology.

11. Limitations & Further Work

1. BL Deferred Principal excluded from regression. The 0% arrears rate for this product causes perfect separation in the logistic model, forcing its exclusion from the regression. With a larger sample (n=32 is small) and more time, a penalised regression (Firth’s logistic regression) would handle this without excluding the product.

2. Class imbalance (88% performing, 12% arrears). The logistic model trained on imbalanced data underperforms at identifying true arrears cases (low sensitivity). SMOTE oversampling or cost-sensitive learning would improve recall for the arrears class — particularly important since the cost of a missed arrears case (credit loss) substantially exceeds the cost of a false alarm (unnecessary review).

3. Number of Installments not fully explained. The installment variable is new to this dataset and its determinants are not captured here. Understanding why some borrowers receive 9–12 installments (is it negotiated? product-driven? credit committee discretion?) would clarify whether the arrears association is causal or mediated by an unobserved credit quality variable.

4. No borrower-level covariates. Industry sector, years in operation, prior credit history, and collateral type — standard credit application fields — are absent. Incorporating these from credit files would likely push AUC above 0.80.

5. Fifteen-month window crosses a calendar year. The data spans March 2025 to May 2026, capturing seasonal demand cycles, but not long enough to fully model them. A 24–36 month panel with seasonal dummy variables or a time series component would separate cyclical from structural arrears patterns.

References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

[Your Name]. (2026). Business loan portfolio dataset — March 2025 to May 2026 [Dataset]. Collected from [Organisation Name / Department], Lagos, Nigeria. Data available on request from the author.

McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.

Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace.

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wilkinson, L. (2005). The grammar of graphics (2nd ed.). Springer.

Appendix: AI Usage Statement

Claude (Anthropic) was used during the preparation of this assignment to assist with structuring the Quarto document template, generating R and Python code scaffolding for the five analytical sections, and advising on appropriate handling of the perfect separation issue in the logistic regression (BL Deferred Principal exclusion). All analytical decisions — the identification of Number of Installments as the primary new predictor variable, the choice and justification of each hypothesis test, the interpretation of the null result for Loan Type, the interpretation of regression coefficients, the integrated recommendation regarding installment count and loan size thresholds, and all limitations identified — were made independently by the author based on domain knowledge and direct review of every model output. The dataset was collected, extracted, and verified by the author from the organisation’s internal loan management system. The author takes full responsibility for all conclusions presented and is prepared to explain and defend every result in the viva voce examination.

Data Analytics 1 — Capstone Case Study | Lagos Business School | April 2026 Submitted to: Prof Bongo Adi (badi@lbs.edu.ng)