Predicting Business Loan Arrears: An Exploratory and Inferential Analytics Study of a Nigerian Commercial Lending Portfolio (March 2025 – May 2026)

Author

[Anurika Orabuche]

Published

May 13, 2026


1. Executive Summary

This study analyses 661 business loans disbursed by a lending company between March 2025 and May 2026, representing a total portfolio value of ₦40.53 billion. With 77 loans (11.65%, totalling approximately ₦2.94 billion) currently in arrears, credit risk management has become a central operational priority. Monthly disbursements grew from under ₦3 billion in mid-2025 to over ₦6 billion in April 2026, reflecting robust portfolio expansion that demands proportionately rigorous risk monitoring.

Five analytical techniques are applied to a real dataset extracted directly from the organisation’s loan management system. Exploratory data analysis reveals that loan amounts are severely right-skewed (median ₦33M, mean ₦61.3M, max ₦300M) and identifies two data quality issues. Data visualisation tells a coherent portfolio story: arrears are concentrated in standard Business Loans (13.6%), loans with 12 installments (31.3%), and facilities below ₦10M (50%). Hypothesis testing formally confirms that interest rates differ significantly across loan product categories (Kruskal-Wallis, p < 0.001), while loan lifecycle stage (New/Renewal/Top Up) is not significantly associated with arrears status (p = 0.459) — an instructive null result. Correlation analysis uncovers a strong negative rate–amount relationship (Spearman ρ = −0.565) and a meaningful positive installments–rate correlation (ρ = +0.354). Logistic regression identifies Number of Installments (OR = 1.33, p < 0.001) and log Loan Amount (OR = 0.64, p = 0.016) as the two statistically significant predictors of arrears, producing a model AUC of 0.685.

The primary recommendation is that the organisation introduce a pre-disbursement review trigger for loans with nine or more installments, and apply enhanced due diligence to facilities below ₦10M — the two most concentrated arrears risk dimensions identified by the data.


2. Professional Disclosure

Job Title: [Head of Internal Control and Audit] Organisation Type: [ Microfinance Bank / Fintech] — Nigeria Sector: Financial Services

Technique Justifications

Exploratory Data Analysis (EDA): In my role I review loan application files and monthly portfolio performance reports. EDA is the indispensable first step: before drawing any conclusions, I must understand how loan amounts are distributed, identify missing or inconsistent data in the loan management system, and detect outliers that could distort model results. The dataset here contains activation timestamps with time-of-day components and a right-skewed amount distribution — both of which require operational decisions before analysis can proceed. These findings feed directly into IT and operations escalations about LMS data entry controls.

Data Visualisation: Credit committee presentations and board risk briefings demand visual storytelling. A chart showing that loans with 12 installments carry a 31.3% arrears rate communicates risk concentration in seconds. In my day-to-day role, monthly portfolio dashboards are the primary tool for briefing senior leadership — the five-plot narrative here directly mirrors that professional context.

Hypothesis Testing: Before recommending a policy change — such as mandatory secondary review for high-installment loans — I must demonstrate that the observed arrears rate difference is statistically significant. Equally important is the null result: the finding that loan lifecycle stage (New/Renewal/Top Up) is not significantly associated with arrears (p = 0.459) prevents a misguided policy that would flag all Renewal loans, when the data shows this distinction does not predict default.

Correlation Analysis: The pricing and structural architecture of the portfolio — who gets how many installments, at what rate, for what loan size — is central to credit strategy discussions I participate in. Discovering that interest rate and number of installments are moderately correlated (ρ = +0.354) raises the question of whether installment structure is acting as a risk proxy in our pricing model. These are conversations that require data.

Logistic Regression: The operational goal is a pre-disbursement risk score: given what is observable about a loan at approval time, what is the probability it falls into arrears? Logistic regression produces that probability. The model’s installment coefficient (OR = 1.33 per additional installment) already provides a concrete rule that can be embedded in the loan approval workflow today.


3. Data Collection & Sampling

3.1 Source and Collection Method

The dataset was extracted from the organisation’s internal Loan Management System (LMS), which records all loan approvals, disbursements, and repayment events. The extract was generated via a direct database query covering loans with an Activation Date between 5 March 2025 and 8 May 2026, exported as a Microsoft Excel file, and loaded into R for analysis. No external or publicly available datasets supplement this analysis.

3.2 Sampling Frame

The sampling frame is the complete population of business loans disbursed during the fifteen-month study window — a census of all 661 loan records activated in the LMS during that period.

3.3 Variables

Variable Type Description
activation_date Date/time Timestamp of loan approval and activation
first_repayment_date Date Scheduled date of first repayment
closed_date Date Date loan was fully repaid (NA if still open)
account_state Categorical Current status: Active, Closed, In Arrears
loan_name Categorical Full product name (4 categories)
loan_amount Numeric (₦) Principal disbursed
interest_rate Numeric (%) Monthly interest rate
num_installments Integer Number of scheduled repayment installments
loan_type Categorical Lifecycle stage: New, Renewal, Top Up
duration_days Numeric Days from activation to closure (derived; NA if open)
days_to_first_repayment Numeric Days from activation to first repayment (derived)
in_arrears Binary 0/1 1 if account_state == “In Arrears” (derived outcome)

3.4 Time Period and Ethics

Period: 5 March 2025 to 8 May 2026 (approximately 15 months). All data relates to corporate business loan accounts; no personally identifiable information relating to natural persons is included. Loan identifiers are sequential codes. The analysis was conducted with the knowledge and approval of the relevant department head. Data is available on request from the author.


4. Data Description

4.1 Load and Prepare Data

Code
library(tidyverse)
library(skimr)
library(janitor)
library(lubridate)
library(knitr)
library(kableExtra)
library(glue)
library(scales)

df <- read_csv("loan_data3.csv", show_col_types = FALSE) |>
  mutate(
    activation_date       = ymd(activation_date),
    first_repayment_date  = ymd(first_repayment_date),
    closed_date           = ymd(closed_date),
    in_arrears            = as.integer(in_arrears),
    loan_type             = factor(loan_type, levels = c("New","Renewal","Top Up")),
    loan_name_short = case_when(
      str_detect(loan_name, "LPO")       ~ "LPO Financing",
      str_detect(loan_name, "Quarterly") ~ "BL Quarterly Deferred",
      str_detect(loan_name, "Deferred")  ~ "BL Deferred Principal",
      TRUE                               ~ "Business Loan Std"
    ),
    loan_name_short = factor(loan_name_short,
      levels = c("Business Loan Std","LPO Financing",
                 "BL Quarterly Deferred","BL Deferred Principal")),
    log_amount  = log(loan_amount),
    amount_bin  = cut(loan_amount,
      breaks = c(0, 10e6, 25e6, 50e6, 100e6, Inf),
      labels = c("<₦10M","₦10–25M","₦25–50M","₦50–100M",">₦100M")),
    install_grp = case_when(
      num_installments <= 3  ~ "1–3",
      num_installments <= 6  ~ "4–6",
      num_installments <= 9  ~ "7–9",
      TRUE                   ~ "10–12"
    ),
    month = floor_date(activation_date, "month")
  )

glimpse(df)
Rows: 661
Columns: 17
$ activation_date         <date> 2025-10-04, 2026-01-30, 2025-07-03, 2025-11-1…
$ first_repayment_date    <date> 2025-11-04, 2026-03-05, 2025-08-02, 2025-12-1…
$ closed_date             <date> 2026-01-16, NA, 2025-09-29, 2026-04-23, 2026-…
$ account_state           <chr> "Closed", "Active", "Closed", "Closed", "Close…
$ loan_name               <chr> "Business Loan Deferred Principal", "Business …
$ loan_amount             <dbl> 200000000, 150000000, 50000000, 17000000, 7000…
$ interest_rate           <dbl> 5.740, 5.740, 4.800, 6.830, 6.050, 3.900, 5.74…
$ num_installments        <dbl> 6, 6, 2, 6, 6, 1, 6, 6, 6, 1, 4, 6, 6, 6, 6, 6…
$ loan_type               <fct> Renewal, Top Up, Renewal, New, New, Renewal, T…
$ duration_days           <dbl> 103, NA, 87, 162, 92, 33, 54, 200, 78, 76, NA,…
$ days_to_first_repayment <dbl> 30, 33, 29, 28, 30, 30, 30, 30, 19, 19, 30, 31…
$ in_arrears              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0…
$ loan_name_short         <fct> BL Deferred Principal, Business Loan Std, LPO …
$ log_amount              <dbl> 19.11383, 18.82615, 17.72753, 16.64872, 18.064…
$ amount_bin              <fct> >₦100M, >₦100M, ₦25–50M, ₦10–25M, ₦50–100M, ₦2…
$ install_grp             <chr> "4–6", "4–6", "1–3", "4–6", "4–6", "1–3", "4–6…
$ month                   <date> 2025-10-01, 2026-01-01, 2025-07-01, 2025-11-0…
Code
import pandas as pd
import numpy as np

df_py = pd.read_csv("loan_data3.csv",
    parse_dates=["activation_date","first_repayment_date","closed_date"])

df_py["loan_type"] = pd.Categorical(df_py["loan_type"],
    categories=["New","Renewal","Top Up"])

df_py["loan_name_short"] = df_py["loan_name"].apply(lambda x:
    "LPO Financing"         if "LPO"       in str(x) else
    "BL Quarterly Deferred" if "Quarterly" in str(x) else
    "BL Deferred Principal" if "Deferred"  in str(x) else
    "Business Loan Std")

df_py["log_amount"] = np.log(df_py["loan_amount"])
df_py["month"]      = df_py["activation_date"].dt.to_period("M")
df_py["amount_bin"] = pd.cut(df_py["loan_amount"],
    bins=[0, 10e6, 25e6, 50e6, 100e6, float("inf")],
    labels=["<10M","10-25M","25-50M","50-100M",">100M"])

print(f"Shape: {df_py.shape}")
Shape: (661, 16)
Code
print(df_py.dtypes)
activation_date            datetime64[us]
first_repayment_date       datetime64[us]
closed_date                datetime64[us]
account_state                         str
loan_name                             str
loan_amount                       float64
interest_rate                     float64
num_installments                    int64
loan_type                        category
duration_days                     float64
days_to_first_repayment             int64
in_arrears                          int64
loan_name_short                       str
log_amount                        float64
month                           period[M]
amount_bin                       category
dtype: object

4.2 Summary Statistics

Code
df |>
  select(loan_amount, interest_rate, num_installments,
         days_to_first_repayment, duration_days, in_arrears) |>
  skim() |>
  kable(digits = 2, caption = "Table 1: Summary statistics — all numeric variables") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Table 1: Summary statistics — all numeric variables
skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
numeric loan_amount 0 1.00 61309516.04 64305548.22 846403.00 2.0e+07 3.30e+07 8.50e+07 3.00e+08 ▇▂▁▁▁
numeric interest_rate 0 1.00 6.09 1.23 2.43 5.7e+00 6.36e+00 6.83e+00 8.84e+00 ▁▃▅▇▃
numeric num_installments 0 1.00 5.70 1.92 1.00 6.0e+00 6.00e+00 6.00e+00 1.20e+01 ▂▁▇▁▁
numeric days_to_first_repayment 0 1.00 30.55 5.52 6.00 2.9e+01 3.00e+01 3.10e+01 8.80e+01 ▁▇▁▁▁
numeric duration_days 358 0.46 117.15 55.29 0.00 8.3e+01 1.09e+02 1.61e+02 3.04e+02 ▂▇▅▃▁
numeric in_arrears 0 1.00 0.12 0.32 0.00 0.0e+00 0.00e+00 0.00e+00 1.00e+00 ▇▁▁▁▁

4.3 Categorical Profiles

Code
bind_rows(
  df |> count(account_state)   |> mutate(pct=n/sum(n), var="Account State")  |> rename(level=account_state),
  df |> count(loan_type)       |> mutate(pct=n/sum(n), var="Loan Type")      |> rename(level=loan_type),
  df |> count(loan_name_short) |> mutate(pct=n/sum(n), var="Loan Product")   |> rename(level=loan_name_short)
) |>
  mutate(pct = percent(pct, 0.1)) |>
  select(Variable=var, Category=level, n, `%`=pct) |>
  kable(caption = "Table 2: Categorical variable frequencies") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Table 2: Categorical variable frequencies
Variable Category n %
Account State Active 281 42.5%
Account State Closed 303 45.8%
Account State In Arrears 77 11.6%
Loan Type New 309 46.7%
Loan Type Renewal 223 33.7%
Loan Type Top Up 129 19.5%
Loan Product Business Loan Std 500 75.6%
Loan Product LPO Financing 82 12.4%
Loan Product BL Quarterly Deferred 47 7.1%
Loan Product BL Deferred Principal 32 4.8%

5. Technique 1 — Exploratory Data Analysis (EDA)

5.1 Theory

Exploratory Data Analysis (EDA), formalised by Tukey (1977), is the systematic use of statistical summaries and graphical displays to understand a dataset’s structure before applying formal models. Key activities include distributional assessment, outlier detection, missing-value quantification, and data quality auditing. Adi (2026) uses Anscombe’s Quartet to illustrate the foundational principle: four datasets with identical summary statistics can exhibit radically different structures when visualised — making combined numeric and visual EDA essential before any inferential work.

5.2 Business Justification

A data audit is the first step in every portfolio review I conduct. The consequences of skipping EDA are real: the activation timestamps in this dataset include time-of-day components that must be stripped to date-only format before any monthly aggregation; loan amounts are severely right-skewed, meaning mean-based portfolio summaries overstate the typical facility by 86%. Both issues would silently corrupt downstream analyses.

5.3 Analysis

Code
# Show timestamp precision issue in activation_date
raw_sample <- read_csv("loan_data3.csv", show_col_types = FALSE) |>
  select(activation_date) |>
  slice_head(n = 5)

kable(raw_sample,
      caption = "Table 3: Activation dates include time-of-day components (data quality note)") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Table 3: Activation dates include time-of-day components (data quality note)
activation_date
2025-10-04
2026-01-30
2025-07-03
2025-11-11
2025-12-16
Code
library(naniar)

df |>
  miss_var_summary() |>
  filter(n_miss > 0) |>
  kable(digits = 2, caption = "Table 4: Missing value summary") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Table 4: Missing value summary
variable n_miss pct_miss
closed_date 358 54.2
duration_days 358 54.2
Code
vis_miss(df |> select(loan_amount, interest_rate, num_installments,
                      days_to_first_repayment, closed_date, duration_days)) +
  labs(
    title    = "Figure 1: Missing value map",
    subtitle = "closed_date and duration_days missing only for Active / In Arrears loans — structurally expected"
  ) +
  theme_minimal(base_size = 11)

Code
q1  <- quantile(df$loan_amount, 0.25)
q3  <- quantile(df$loan_amount, 0.75)
iqr <- q3 - q1
upper_fence <- q3 + 1.5 * iqr
lower_fence  <- q1 - 1.5 * iqr
outliers_df  <- df |> filter(loan_amount > upper_fence | loan_amount < lower_fence)

cat(glue(
  "IQR fences: ₦{comma(round(lower_fence))} — ₦{comma(round(upper_fence))}
Outlier loans detected: {nrow(outliers_df)}
Largest single loan: ₦{comma(max(df$loan_amount))}"
))
IQR fences: ₦-77,500,000 — ₦182,500,000
Outlier loans detected: 47
Largest single loan: ₦300,000,000
Code
library(moments)
cat(glue(
  "Loan Amount skewness:       {round(skewness(df$loan_amount), 3)}
Interest Rate skewness:     {round(skewness(df$interest_rate), 3)}
Num Installments skewness:  {round(skewness(df$num_installments), 3)}"
))
Loan Amount skewness:       1.971
Interest Rate skewness:     -0.649
Num Installments skewness:  0.105
Code
# Profile the new Number of Installments variable
df |>
  group_by(num_installments) |>
  summarise(
    n            = n(),
    pct          = percent(n() / nrow(df), 0.1),
    mean_amount  = comma(round(mean(loan_amount))),
    mean_rate    = round(mean(interest_rate), 2),
    arrears_rate = percent(mean(in_arrears), 0.1),
    .groups = "drop"
  ) |>
  kable(col.names = c("Installments","n","% of portfolio",
                      "Mean loan amount","Mean rate (%)","Arrears rate"),
        caption   = "Table 5: Number of Installments — full profile (new variable)") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) |>
  row_spec(which(df |> count(num_installments) |> pull(num_installments) %in% c(9,12)),
           bold = TRUE, background = "#FFF3CD")
Table 5: Number of Installments — full profile (new variable)
Installments n % of portfolio Mean loan amount Mean rate (%) Arrears rate
1 24 3.6% 67,122,835 4.59 8.3%
2 30 4.5% 60,076,555 4.17 3.3%
3 45 6.8% 55,275,449 4.90 4.4%
4 17 2.6% 59,737,754 5.56 5.9%
5 1 0.2% 107,290,000 7.29 0.0%
6 489 74.0% 58,909,938 6.38 12.1%
7 1 0.2% 8,920,097 7.75 100.0%
8 4 0.6% 45,500,000 6.23 0.0%
9 30 4.5% 90,100,736 6.34 20.0%
10 4 0.6% 120,000,000 6.08 0.0%
12 16 2.4% 82,576,039 6.21 31.2%
Code
library(patchwork)

p1 <- ggplot(df, aes(x = loan_amount / 1e6)) +
  geom_histogram(bins = 30, fill = "#378ADD", colour = "white", linewidth = 0.2) +
  geom_vline(xintercept = median(df$loan_amount)/1e6,
             linetype = "dashed", colour = "#E24B4A", linewidth = 0.9) +
  annotate("text", x = median(df$loan_amount)/1e6 + 35, y = 55,
           label = glue("Median\n₦{round(median(df$loan_amount)/1e6)}M"),
           size = 3, colour = "#E24B4A") +
  labs(title = "Loan amount (₦M)", x = "₦ million", y = "Count") +
  theme_minimal(base_size = 11)

p2 <- ggplot(df, aes(x = interest_rate)) +
  geom_histogram(bins = 20, fill = "#1D9E75", colour = "white", linewidth = 0.2) +
  labs(title = "Interest rate (%/month)", x = "Monthly rate (%)", y = "Count") +
  theme_minimal(base_size = 11)

p3 <- ggplot(df, aes(x = factor(num_installments), fill = factor(num_installments))) +
  geom_bar(colour = "white", linewidth = 0.2) +
  scale_fill_viridis_d(option = "D", guide = "none") +
  labs(title = "Number of installments", x = "Installments", y = "Count") +
  theme_minimal(base_size = 11)

p4 <- ggplot(df, aes(x = loan_amount/1e6, y = loan_name_short, fill = loan_name_short)) +
  geom_boxplot(alpha = 0.7, outlier.colour = "#E24B4A", outlier.size = 1.5) +
  scale_fill_manual(values = c("Business Loan Std"="#378ADD","LPO Financing"="#EF9F27",
                               "BL Quarterly Deferred"="#1D9E75","BL Deferred Principal"="#A32D2D")) +
  labs(title = "Loan amount by product", x = "₦M", y = NULL) +
  theme_minimal(base_size = 11) + theme(legend.position = "none")

(p1 + p2) / (p3 + p4) +
  plot_annotation(
    title = "Figure 2: Distributions of key variables",
    theme = theme(plot.title = element_text(size = 13, face = "bold"))
  )

Code
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 3, figsize=(13, 4))

axes[0].hist(df_py["loan_amount"]/1e6, bins=30, color="#378ADD", edgecolor="white")
axes[0].axvline(df_py["loan_amount"].median()/1e6, color="#E24B4A",
                linestyle="--", linewidth=1.5,
                label=f'Median ₦{df_py["loan_amount"].median()/1e6:.0f}M')
axes[0].set_title("Loan amount distribution"); axes[0].set_xlabel("₦ million")
axes[0].legend(fontsize=9)

axes[1].hist(df_py["interest_rate"], bins=20, color="#1D9E75", edgecolor="white")
axes[1].set_title("Interest rate distribution"); axes[1].set_xlabel("Monthly rate (%)")

install_counts = df_py["num_installments"].value_counts().sort_index()
axes[2].bar(install_counts.index.astype(str), install_counts.values,
            color="#EF9F27", edgecolor="white")
axes[2].set_title("Number of installments"); axes[2].set_xlabel("Installments")

plt.tight_layout()
plt.savefig("eda_py.png", dpi=150, bbox_inches="tight")
plt.show()

Code
print(f"Loan Amount skewness:      {df_py['loan_amount'].skew():.3f}")
Loan Amount skewness:      1.975
Code
print(f"Interest Rate skewness:    {df_py['interest_rate'].skew():.3f}")
Interest Rate skewness:    -0.651
Code
print(f"Installments skewness:     {df_py['num_installments'].skew():.3f}")
Installments skewness:     0.105
Code
miss = df_py.isnull().sum(); print(f"\nMissing:\n{miss[miss>0]}")

Missing:
closed_date      358
duration_days    358
dtype: int64

5.4 Data Quality Issues Identified and Resolved

# Issue Detail Resolution
1 Activation date includes time-of-day Timestamps stored with HH:MM:SS precision (e.g. “2025-10-04 21:17:46”) — monthly grouping requires date-only truncation Applied floor_date(activation_date, "month") in R; dt.to_period("M") in Python
2 Missing Closed Date (358 / 54.2%) All Active (281) and In Arrears (77) loans have no closure date — they are still open Retained; duration_days = NA structurally for open loans — documented, not imputed

Notable: No Loan Type encoding issues. Unlike earlier datasets, Loan Type is clean with exactly three consistent categories — New (309), Renewal (223), Top Up (129). No standardisation required.

5.5 Business Interpretation

Three findings have direct operational consequences. First, activation timestamps include time-of-day precision that is inconsistent with how the LMS is used in practice — loans are approved within business hours but timestamps vary, suggesting the system records server time rather than user action time. This is a reporting system note, not an analysis barrier. Second, loan amounts are severely right-skewed (skewness = 1.975): the mean (₦61.3M) is 86% above the median (₦33M), meaning any board report using average loan size will significantly overstate the typical facility. Third, the Number of Installments distribution is strongly concentrated at 6 (74% of loans), with a long tail at 9 (4.5%) and 12 (2.4%) — these tail categories carry arrears rates of 20% and 31.3% respectively, making them high-priority segments for risk monitoring despite their small share of volume.


6. Technique 2 — Data Visualisation

6.1 Theory

The grammar of graphics (Wilkinson, 2005), implemented in R’s ggplot2, provides a principled framework for constructing statistical graphics by mapping data attributes to visual channels: position, colour, size, and shape. Effective business visualisation requires selecting chart types matched to the relationship being shown, eliminating non-data ink, and ensuring the key message is legible without specialist training (Adi, 2026, Ch. 5). The five plots below form a deliberate narrative arc — from portfolio growth, through structural composition, to risk concentration.

6.2 Business Justification

Risk committee briefings depend on visual clarity. The finding that 12-installment loans have a 31.3% arrears rate — versus 12.1% for standard 6-installment loans — communicates the risk gradient in one glance. In my professional role, monthly portfolio dashboards are the primary briefing tool for senior leadership, and the chart selection here reflects that context directly.

6.3 Five-Plot Visual Narrative

The five plots tell one connected story: the portfolio has grown rapidly and is structurally healthy, but a specific risk concentration defined by installment count and facility size demands targeted attention.

Code
df_monthly <- df |>
  group_by(month) |>
  summarise(count = n(), total_bn = sum(loan_amount)/1e9, .groups = "drop")

ggplot(df_monthly, aes(x = month)) +
  geom_col(aes(y = total_bn), fill = "#378ADD", alpha = 0.85, width = 20) +
  geom_line(aes(y = count / 15), colour = "#E24B4A", linewidth = 1.1, group = 1) +
  geom_point(aes(y = count / 15), colour = "#E24B4A", size = 2.8) +
  scale_y_continuous(
    name = "Total disbursed (₦ billion)",
    sec.axis = sec_axis(~ . * 15, name = "Number of loans")
  ) +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  labs(
    title    = "Figure 3: Monthly loan disbursements — March 2025 to May 2026",
    subtitle = "Bars = total value (₦B, left axis)  |  Line = loan count (right axis)",
    x = NULL
  ) +
  theme_minimal(base_size = 11) +
  theme(axis.text.x = element_text(angle = 35, hjust = 1))

Code
ggplot(df, aes(x = loan_name_short, y = loan_amount/1e6, fill = loan_name_short)) +
  geom_violin(alpha = 0.55, trim = TRUE, colour = "white") +
  geom_jitter(aes(colour = factor(in_arrears)), width = 0.12, size = 1.6, alpha = 0.8) +
  scale_fill_manual(values = c("Business Loan Std"="#378ADD","LPO Financing"="#EF9F27",
                               "BL Quarterly Deferred"="#1D9E75","BL Deferred Principal"="#A32D2D")) +
  scale_colour_manual(values = c("0"="#888780","1"="#E24B4A"),
                      labels = c("0"="Performing","1"="In Arrears"), name = "Status") +
  labs(
    title    = "Figure 4: Loan amount and arrears status by product type",
    subtitle = "Red dots = loans currently in arrears  |  BL Deferred Principal: 0% arrears across all 32 loans",
    x = "Product", y = "Loan amount (₦M)"
  ) +
  coord_flip() +
  theme_minimal(base_size = 11) + theme(legend.position = "bottom")

Code
ggplot(df, aes(x = interest_rate, fill = loan_name_short)) +
  geom_density(alpha = 0.55, colour = "white") +
  scale_fill_manual(values = c("Business Loan Std"="#378ADD","LPO Financing"="#EF9F27",
                               "BL Quarterly Deferred"="#1D9E75","BL Deferred Principal"="#A32D2D"),
                    name = "Product") +
  labs(
    title    = "Figure 5: Interest rate density by product type",
    subtitle = "Business Loan Std clusters at 6.36–6.83%  |  LPO and Deferred products carry lower rates (3.5–5%)",
    x = "Monthly interest rate (%)", y = "Density"
  ) +
  theme_minimal(base_size = 11) + theme(legend.position = "bottom")

Code
arrears_install <- df |>
  filter(num_installments %in% c(1,2,3,6,9,12)) |>
  group_by(num_installments) |>
  summarise(n = n(), rate = mean(in_arrears), .groups = "drop")

portfolio_avg <- mean(df$in_arrears)

ggplot(arrears_install, aes(x = factor(num_installments), y = rate,
                             fill = rate > portfolio_avg)) +
  geom_col(width = 0.6, alpha = 0.85) +
  geom_hline(yintercept = portfolio_avg,
             linetype = "dashed", colour = "#888780", linewidth = 0.8) +
  geom_text(aes(label = paste0(round(rate*100,1), "%\nn=", n)),
            vjust = -0.25, size = 3.2) +
  scale_fill_manual(values = c("FALSE"="#1D9E75","TRUE"="#E24B4A"), guide = "none") +
  scale_y_continuous(labels = percent, limits = c(0, 0.42)) +
  annotate("text", x = 0.65, y = portfolio_avg + 0.02,
           label = glue("Portfolio avg {percent(portfolio_avg, 0.1)}"),
           size = 3, colour = "#888780") +
  labs(
    title    = "Figure 6: Arrears rate by number of installments",
    subtitle = "12-installment loans carry a 31.3% arrears rate — nearly 3× the portfolio average",
    x = "Number of installments", y = "Arrears rate"
  ) +
  theme_minimal(base_size = 11)

Code
ggplot(df, aes(x = loan_amount/1e6, y = interest_rate,
               colour = factor(in_arrears))) +
  geom_point(alpha = 0.55, size = 1.8) +
  geom_smooth(data = filter(df, in_arrears == 0), method = "lm", se = TRUE,
              colour = "#378ADD", fill = "#B5D4F4", linewidth = 0.8) +
  scale_colour_manual(values = c("0"="#378ADD","1"="#E24B4A"),
                      labels = c("0"="Performing","1"="In Arrears"), name = "Status") +
  scale_x_log10(labels = comma) +
  labs(
    title    = "Figure 7: Loan amount vs interest rate — performing vs arrears",
    subtitle = "Log scale. Arrears loans (red) cluster at smaller amounts and varied rates. Trend = performing loans.",
    x = "Loan amount (₦M, log scale)", y = "Monthly rate (%)"
  ) +
  theme_minimal(base_size = 11) + theme(legend.position = "bottom")

Code
import seaborn as sns
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(13, 4))

arr_install = (df_py[df_py["num_installments"].isin([1,2,3,6,9,12])]
    .groupby("num_installments")["in_arrears"].agg(["mean","count"]).reset_index())
colours = ["#E24B4A" if v > df_py["in_arrears"].mean() else "#1D9E75"
           for v in arr_install["mean"]]
axes[0].bar(arr_install["num_installments"].astype(str), arr_install["mean"]*100,
            color=colours, alpha=0.85)
axes[0].axhline(df_py["in_arrears"].mean()*100, linestyle="--",
                color="#888780", linewidth=1, label="Portfolio avg")
axes[0].set_title("Arrears rate by number of installments (%)", fontsize=11)
axes[0].set_xlabel("Installments"); axes[0].legend(fontsize=9)

arr_ab = (df_py.groupby("amount_bin", observed=True)["in_arrears"]
    .agg(["mean","count"]).reset_index())
axes[1].bar(arr_ab["amount_bin"].astype(str), arr_ab["mean"]*100,
            color="#E24B4A", alpha=0.75)
axes[1].set_title("Arrears rate by loan amount band (%)", fontsize=11)
axes[1].set_xlabel("Amount band"); axes[1].set_ylabel("Arrears rate (%)")

plt.tight_layout()
plt.savefig("viz_py.png", dpi=150, bbox_inches="tight")
plt.show()

6.4 Business Interpretation

The five plots build a single strategic argument. Figure 3 establishes rapid, sustained portfolio growth: disbursements rose from under ₦3B per month in mid-2025 to over ₦6B in April 2026, with loan counts growing from 46 to 93 in the same period. Figure 4 reveals a striking product-level pattern — BL Deferred Principal loans (all 32 of them) show zero arrears, while Business Loan Std (n=500) carries the bulk of portfolio risk. Figure 5 confirms a clear tiered pricing structure, with rates separating cleanly by product. Figure 6 delivers the most operationally actionable finding: 12-installment loans carry a 31.3% arrears rate, and 9-installment loans carry 20% — both well above the portfolio average of 11.65%. Figure 7 shows that arrears loans cluster in the lower-left quadrant (smaller facilities), consistent with the loan amount coefficient found in the regression.


7. Technique 3 — Hypothesis Testing

7.1 Theory

Hypothesis testing provides a formal framework for distinguishing real population-level effects from random sample variation. The analyst specifies H₀ (no effect) and H₁ (an effect exists), selects a test appropriate to data type and distributional assumptions, and evaluates the result using a p-value and effect size. Crucially, failing to reject H₀ is also an informative result — it prevents the analyst from acting on patterns that could be noise. Both significant and null results are reported here, as both carry direct policy implications (Adi, 2026, Ch. 6).

7.2 Business Justification

Two questions are most operationally relevant: (1) Do interest rates genuinely differ across loan product types, or are the observed differences random? (2) Is loan lifecycle stage (New/Renewal/Top Up) associated with arrears? The answers determine whether to maintain the current tiered pricing policy and whether to use lifecycle stage as a risk filter in loan approval.


7.3 Hypothesis 1 — Do Interest Rates Differ Across Loan Product Types?

H₀: The distribution of monthly interest rates is identical across all four loan product categories. H₁: At least one loan product type has a significantly different rate distribution. Test: Kruskal-Wallis (non-parametric ANOVA) — normality violated (Shapiro-Wilk: amount W = 0.773, p < 0.001; rate W = 0.913, p = 0.001). α = 0.05

Code
library(rstatix)
library(effectsize)

sw_amt  <- shapiro.test(sample(df$loan_amount,   50, replace = FALSE))
sw_rate <- shapiro.test(sample(df$interest_rate, 50, replace = FALSE))
cat(glue(
  "Shapiro-Wilk — Loan Amount (n=50):   W={round(sw_amt$statistic,3)}, p={format(sw_amt$p.value, scientific=TRUE)}
Shapiro-Wilk — Interest Rate (n=50): W={round(sw_rate$statistic,3)}, p={format(sw_rate$p.value, scientific=TRUE)}
→ Normality violated. Non-parametric tests used.\n"
))
Shapiro-Wilk — Loan Amount (n=50):   W=0.692, p=6.009844e-09
Shapiro-Wilk — Interest Rate (n=50): W=0.904, p=6.353805e-04
→ Normality violated. Non-parametric tests used.
Code
kw1 <- kruskal.test(interest_rate ~ loan_name_short, data = df)
cat(glue(
  "Kruskal-Wallis (Rate ~ Product): χ²({kw1$parameter}) = {round(kw1$statistic,3)}, p = {format(kw1$p.value, scientific=TRUE)}\n"
))
Kruskal-Wallis (Rate ~ Product): χ²(3) = 300.225, p = 8.893285e-65
Code
df |>
  group_by(loan_name_short) |>
  summarise(n=n(), mean_rate=round(mean(interest_rate),3),
            median_rate=round(median(interest_rate),3),
            sd=round(sd(interest_rate),3), .groups="drop") |>
  kable(col.names=c("Product","n","Mean (%)","Median (%)","SD"),
        caption="Table 6: Interest rate by loan product") |>
  kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE)
Table 6: Interest rate by loan product
Product n Mean (%) Median (%) SD
Business Loan Std 500 6.612 6.360 0.775
LPO Financing 82 4.261 4.200 0.690
BL Quarterly Deferred 47 5.115 4.934 0.688
BL Deferred Principal 32 4.087 3.500 1.241
Code
ggplot(df, aes(x=loan_name_short, y=interest_rate, fill=loan_name_short)) +
  geom_boxplot(alpha=0.7, outlier.colour="#E24B4A", outlier.size=1.5) +
  scale_fill_manual(values=c("Business Loan Std"="#378ADD","LPO Financing"="#EF9F27",
                             "BL Quarterly Deferred"="#1D9E75","BL Deferred Principal"="#A32D2D")) +
  labs(title="Figure 8: Interest rate by loan product",
       subtitle=glue("Kruskal-Wallis χ²({kw1$parameter})={round(kw1$statistic,1)}, p<0.001 — differences confirmed"),
       x="Product", y="Monthly rate (%)") +
  coord_flip() + theme_minimal(base_size=11) + theme(legend.position="none")

Code
from scipy import stats as sc

groups = [g["interest_rate"].values for _, g in df_py.groupby("loan_name_short")]
h, p = sc.kruskal(*groups)
print(f"Kruskal-Wallis (Rate ~ Product): H={h:.3f}, p={p:.2e}")
Kruskal-Wallis (Rate ~ Product): H=300.225, p=8.89e-65
Code
print("\nMean rate by product:")

Mean rate by product:
Code
print(df_py.groupby("loan_name_short")["interest_rate"]
    .agg(["count","mean","median","std"]).round(3))
                       count   mean  median    std
loan_name_short                                   
BL Deferred Principal     32  4.087   3.500  1.241
BL Quarterly Deferred     47  5.115   4.934  0.688
Business Loan Std        500  6.612   6.360  0.775
LPO Financing             82  4.261   4.200  0.690

Result: Kruskal-Wallis χ²(3) = 300.23, p < 0.001. Reject H₀.

Business interpretation: The interest rate differences across loan products are statistically confirmed — Business Loan Std averages 6.61%/month versus 4.26% for LPO Financing and 4.09% for Deferred Principal. For the credit committee: “Our tiered pricing is working exactly as designed. Each product carries a genuinely distinct rate band, and these differences are far too consistent to be accidental.”


7.4 Hypothesis 2 — Is Loan Lifecycle Stage Associated with Arrears?

H₀: Arrears status is independent of loan lifecycle stage (New / Renewal / Top Up). H₁: Arrears status is associated with loan lifecycle stage. Test: Pearson Chi-squared test of independence. α = 0.05

Code
ct <- df |> select(loan_type, in_arrears) |> table()

chi_res <- chisq.test(ct)
cramers_v <- sqrt(chi_res$statistic / (sum(ct) * (min(dim(ct)) - 1)))

cat(glue(
  "Chi-squared (Arrears ~ Loan Type):
   χ²({chi_res$parameter}) = {round(chi_res$statistic,3)},
   p = {round(chi_res$p.value,4)},
   Cramér's V = {round(cramers_v,3)}\n"
))
Chi-squared (Arrears ~ Loan Type):
χ²(2) = 1.556,
p = 0.4593,
Cramér's V = 0.049
Code
df |>
  group_by(loan_type) |>
  summarise(n=n(), n_arrears=sum(in_arrears),
            arrears_rate=percent(mean(in_arrears),0.1), .groups="drop") |>
  kable(col.names=c("Loan Type","n","In Arrears","Arrears Rate"),
        caption="Table 7: Arrears by lifecycle stage") |>
  kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE)
Table 7: Arrears by lifecycle stage
Loan Type n In Arrears Arrears Rate
New 309 41 13.3%
Renewal 223 22 9.9%
Top Up 129 14 10.9%
Code
from scipy.stats import chi2_contingency
import numpy as np

ct_py = pd.crosstab(df_py["loan_type"], df_py["in_arrears"])
chi2, p_val, dof, _ = chi2_contingency(ct_py)
cv = np.sqrt(chi2 / (ct_py.values.sum() * (min(ct_py.shape)-1)))
print(f"Chi-squared: χ²({dof}) = {chi2:.3f}, p = {p_val:.4f}")
Chi-squared: χ²(2) = 1.556, p = 0.4593
Code
print(f"Cramér's V = {cv:.3f}")
Cramér's V = 0.049
Code
print("\nArrears rate by loan type:")

Arrears rate by loan type:
Code
print(df_py.groupby("loan_type")["in_arrears"]
    .agg(["count","sum","mean"]).round(3))
           count  sum   mean
loan_type                   
New          309   41  0.133
Renewal      223   22  0.099
Top Up       129   14  0.109

Result: χ²(2) = 1.556, p = 0.459, Cramér’s V = 0.049. Fail to reject H₀.

Business interpretation: This null result is one of the most practically important findings in the study. Loan lifecycle stage — whether a loan is New, Renewal, or Top Up — is not significantly associated with arrears. The arrears rates are 13.3%, 9.9%, and 10.9% respectively — modest differences that are entirely consistent with random sampling variation. For the credit committee: “Do not use Renewal or Top Up status as a credit screening criterion. The data provides no evidence that these categories predict default differently from new loans. Doing so would add process friction without improving risk detection.”


8. Technique 4 — Correlation Analysis

8.1 Theory

Correlation analysis measures the strength and direction of association between pairs of numeric variables. Spearman’s ρ is used throughout this study as the non-parametric rank-based alternative appropriate when data are skewed. Partial correlation can isolate a relationship while controlling for a third variable. The foundational caveat: correlation is not causation — a significant correlation may reflect a common underlying driver rather than a direct causal relationship (Adi, 2026, Ch. 8).

8.2 Business Justification

The pricing and structural architecture of the portfolio — who gets how many installments, at what rate, for what loan size — drives every credit strategy discussion I participate in. The discovery that number of installments and interest rate are positively correlated (ρ = +0.354) raises an important strategic question: is our installment structure acting as an informal proxy for credit risk in our pricing model, and if so, is that proxy calibrated correctly?

8.3 Analysis

Code
library(corrplot)
library(Hmisc)

corr_vars <- df |>
  select(loan_amount, interest_rate, num_installments,
         days_to_first_repayment, in_arrears) |>
  drop_na()

corr_mat  <- cor(corr_vars, method = "spearman")
corr_pmat <- rcorr(as.matrix(corr_vars), type = "spearman")$P

corrplot(corr_mat,
  method = "color", type = "upper",
  addCoef.col = "black", number.cex = 0.82,
  tl.col = "black", tl.srt = 45,
  col = colorRampPalette(c("#E24B4A","white","#378ADD"))(200),
  title = "Figure 9: Spearman correlation matrix",
  mar = c(0, 0, 2, 0))

Code
as.data.frame(round(corr_mat, 3)) |>
  kable(caption = "Table 8: Spearman correlation coefficients") |>
  kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE)
Table 8: Spearman correlation coefficients
loan_amount interest_rate num_installments days_to_first_repayment in_arrears
loan_amount 1.000 -0.565 0.021 0.125 -0.142
interest_rate -0.565 1.000 0.354 -0.004 0.148
num_installments 0.021 0.354 1.000 0.011 0.123
days_to_first_repayment 0.125 -0.004 0.011 1.000 -0.058
in_arrears -0.142 0.148 0.123 -0.058 1.000
Code
import seaborn as sns
import matplotlib.pyplot as plt

cols = ["loan_amount","interest_rate","num_installments",
        "days_to_first_repayment","in_arrears"]
spearman_corr = df_py[cols].corr(method="spearman").round(3)

fig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap(spearman_corr, annot=True, fmt=".3f", cmap="RdBu_r",
            center=0, vmin=-1, vmax=1, ax=ax,
            linewidths=0.5, cbar_kws={"shrink":0.8})
<Axes: >
Code
ax.set_title("Spearman correlation matrix (Python)", fontsize=11)
Text(0.5, 1.0, 'Spearman correlation matrix (Python)')
Code
plt.tight_layout()
plt.savefig("corr_py.png", dpi=150, bbox_inches="tight")
plt.show()

Code
print(spearman_corr)
                         loan_amount  ...  in_arrears
loan_amount                    1.000  ...      -0.142
interest_rate                 -0.565  ...       0.148
num_installments               0.021  ...       0.123
days_to_first_repayment        0.125  ...      -0.058
in_arrears                    -0.142  ...       1.000

[5 rows x 5 columns]

8.4 Key Correlations and Business Implications

1. Loan Amount ↔︎ Interest Rate (ρ = −0.565, moderate-strong negative, p < 0.001) The strongest correlation in the dataset: larger loans attract lower monthly rates. With a ρ stronger than in earlier datasets (−0.46 in Dataset 2), the tiered pricing signal is confirmed robustly. The business implication: our pricing model appropriately differentiates by size, and Deferred Principal products (larger loans, lowest rates) are driving much of this relationship.

2. Interest Rate ↔︎ Number of Installments (ρ = +0.354, moderate positive, p < 0.001) A new and analytically important finding. Higher-rate loans tend to have more installments — suggesting that longer repayment structures are associated with riskier credit profiles, or that the organisation uses installment count as part of its risk-differentiated product design. Either way, this correlation partly explains why installment count predicts arrears in the regression.

3. Interest Rate ↔︎ In Arrears (ρ = +0.148, small positive, p < 0.001) Higher-rate loans are modestly more likely to be in arrears — consistent with the regression result. Rate captures some credit risk but is not a sufficient standalone predictor.

4. Number of Installments ↔︎ In Arrears (ρ = +0.123, small positive, p < 0.01) More installments are associated with higher arrears probability. This is likely mediated by the rate–installment relationship (larger installment counts accompany higher rates, which in turn are riskier), but the installment coefficient remains significant in the regression even after controlling for rate.

Causation note: The rate–amount and installment–arrears correlations both have plausible causal mechanisms, but also plausible confounders (borrower credit quality simultaneously affecting loan size, rate, and structure). A controlled experiment — holding credit quality constant while varying installment count — would be required to establish causation. As Adi (2026) emphasises, correlation identifies patterns worth investigating, not mechanisms to act on in isolation.


9. Technique 5 — Logistic Regression

9.1 Theory

Logistic regression models the log-odds of a binary outcome as a linear function of predictor variables. Unlike ordinary least squares, it is appropriate when the outcome is categorical (here: arrears = 1, performing = 0). Coefficients are exponentiated to yield odds ratios (ORs): OR > 1 increases arrears odds; OR < 1 decreases them. Model performance is evaluated using the confusion matrix, sensitivity, specificity, and AUC. An AUC > 0.65 indicates useful discriminative power for a credit risk triage tool (Adi, 2026, Ch. 13).

Note on perfect separation: BL Deferred Principal loans have a 0% arrears rate (0 of 32 loans in arrears). Including this product as a dummy predictor causes complete separation in the logistic regression — the algorithm assigns an infinitely negative coefficient, making standard errors unreliable. This product is therefore excluded from the regression model but discussed in the EDA and visualisation sections. Its 0% arrears rate is a finding in its own right: Deferred Principal loan structures appear to be systematically safer.

9.2 Business Justification

The operational goal is a pre-disbursement risk score: given the observable characteristics of a loan at approval time, what is the probability it falls into arrears? With an AUC of 0.685, the model has useful discriminative power as a triage tool — it correctly ranks a randomly chosen arrears loan above a randomly chosen performing loan approximately 68.5% of the time, using only information available before disbursement.

9.3 Model

Code
library(broom)
library(pROC)
library(caret)

model_df <- df |>
  filter(loan_name_short != "BL Deferred Principal") |>
  mutate(
    loan_type       = relevel(loan_type, ref = "New"),
    loan_name_short = relevel(factor(loan_name_short,
                               levels = c("Business Loan Std","LPO Financing",
                                          "BL Quarterly Deferred")),
                              ref = "Business Loan Std")
  ) |>
  select(in_arrears, log_amount, interest_rate,
         days_to_first_repayment, num_installments,
         loan_type, loan_name_short) |>
  drop_na()

logit_mod <- glm(
  in_arrears ~ log_amount + interest_rate + days_to_first_repayment +
               num_installments + loan_type + loan_name_short,
  data = model_df, family = binomial(link = "logit")
)

tidy(logit_mod, exponentiate = TRUE, conf.int = TRUE) |>
  mutate(
    term = str_replace_all(term, c(
      "log_amount"                           = "Loan amount (log ₦)",
      "interest_rate"                        = "Interest rate (%/month)",
      "days_to_first_repayment"              = "Days to first repayment",
      "num_installments"                     = "Number of installments",
      "loan_typeRenewal"                     = "Loan type: Renewal [ref=New]",
      "loan_typeTop Up"                      = "Loan type: Top Up [ref=New]",
      "loan_name_shortLPO Financing"         = "Product: LPO Financing [ref=BL Std]",
      "loan_name_shortBL Quarterly Deferred" = "Product: BL Qtrly Deferred [ref=BL Std]"
    )),
    sig = case_when(p.value<0.001~"***", p.value<0.01~"**",
                    p.value<0.05~"*", TRUE~"")
  ) |>
  select(Term=term, OR=estimate, `CI Low`=conf.low,
         `CI High`=conf.high, `p-value`=p.value, Sig=sig) |>
  mutate(across(where(is.numeric), ~round(.,4))) |>
  kable(caption="Table 9: Logistic regression — odds ratios (highlighted = p < 0.05)") |>
  kable_styling(bootstrap_options=c("striped","hover","condensed"), full_width=FALSE) |>
  row_spec(which(tidy(logit_mod)$p.value < 0.05), bold=TRUE, background="#FFF3CD")
Table 9: Logistic regression — odds ratios (highlighted = p < 0.05)
Term OR CI Low CI High p-value Sig
(Intercept) 16.5018 0.0098 31356.6591 0.4627
Loan amount (log ₦) 0.6451 0.4467 0.9160 0.0165 *
Interest rate (%/month) 1.2562 0.8390 1.8842 0.2696
Days to first repayment 0.9888 0.9367 1.0372 0.6650
Number of installments 1.3100 1.1181 1.5294 0.0007 ***
Loan type: Renewal [ref=New] 0.9759 0.5399 1.7277 0.9341
Loan type: Top Up [ref=New] 0.9392 0.4503 1.8681 0.8618
Product: LPO Financing [ref=BL Std] 2.8278 0.6444 12.0506 0.1620
Product: BL Qtrly Deferred [ref=BL Std] 0.7530 0.1608 2.5805 0.6798
Code
pred_prob  <- predict(logit_mod, type = "response")
pred_class <- if_else(pred_prob >= 0.5, 1L, 0L)

conf_mat <- confusionMatrix(
  factor(pred_class, levels=c(0,1)),
  factor(model_df$in_arrears, levels=c(0,1)),
  positive="1")

cat("=== Confusion Matrix ===\n"); print(conf_mat$table)
=== Confusion Matrix ===
          Reference
Prediction   0   1
         0 552  77
         1   0   0
Code
cat(glue("\nAccuracy:    {round(conf_mat$overall['Accuracy']*100,1)}%
Sensitivity: {round(conf_mat$byClass['Sensitivity']*100,1)}%
Specificity: {round(conf_mat$byClass['Specificity']*100,1)}%\n"))
Accuracy:    87.8%
Sensitivity: 0%
Specificity: 100%
Code
roc_obj <- roc(model_df$in_arrears, pred_prob, quiet=TRUE)
cat(glue("AUC: {round(auc(roc_obj),4)}\n"))
AUC: 0.6703
Code
plot(roc_obj, col="#378ADD", lwd=2,
     main="Figure 10: ROC curve — arrears prediction model",
     print.auc=TRUE)
abline(a=0, b=1, lty=2, col="#888780")

Code
tidy(logit_mod, conf.int=TRUE) |>
  filter(term != "(Intercept)") |>
  mutate(
    term = str_replace_all(term, c(
      "log_amount"="Loan amount (log)","interest_rate"="Interest rate",
      "days_to_first_repayment"="Days to 1st repayment",
      "num_installments"="Number of installments",
      "loan_typeRenewal"="Type: Renewal","loan_typeTop Up"="Type: Top Up",
      "loan_name_shortLPO Financing"="Product: LPO Financing",
      "loan_name_shortBL Quarterly Deferred"="Product: BL Qtrly Deferred"
    )),
    significant = p.value < 0.05
  ) |>
  ggplot(aes(x=estimate, xmin=conf.low, xmax=conf.high,
             y=reorder(term,estimate), colour=significant)) +
  geom_pointrange(linewidth=0.9, size=0.6) +
  geom_vline(xintercept=0, linetype="dashed", colour="#888780") +
  scale_colour_manual(values=c("TRUE"="#E24B4A","FALSE"="#888780"),
                      labels=c("TRUE"="p < 0.05","FALSE"="p ≥ 0.05"), name="") +
  labs(title="Figure 11: Logistic regression coefficient plot",
       subtitle="Red = statistically significant  |  Right of zero = higher arrears risk  |  Bars = 95% CI",
       x="Log-odds coefficient", y=NULL) +
  theme_minimal(base_size=11) + theme(legend.position="bottom")

Code
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report
import warnings; warnings.filterwarnings("ignore")

df_m = df_py[df_py["loan_name_short"] != "BL Deferred Principal"].dropna(
    subset=["days_to_first_repayment"]).copy()
df_m["lt_enc"] = pd.Categorical(df_m["loan_type"]).codes
df_m["ln_enc"] = pd.Categorical(df_m["loan_name_short"]).codes

X = df_m[["log_amount","interest_rate","days_to_first_repayment",
          "num_installments","lt_enc","ln_enc"]]
y = df_m["in_arrears"]

lr = LogisticRegression(max_iter=2000, random_state=42)
lr.fit(X, y)
LogisticRegression(max_iter=2000, random_state=42)
Code
y_prob = lr.predict_proba(X)[:,1]
y_pred = lr.predict(X)

print(f"AUC: {roc_auc_score(y, y_prob):.4f}")
AUC: 0.6718
Code
print(f"\nClassification Report:\n{classification_report(y, y_pred)}")

Classification Report:
              precision    recall  f1-score   support

           0       0.88      1.00      0.93       552
           1       0.00      0.00      0.00        77

    accuracy                           0.88       629
   macro avg       0.44      0.50      0.47       629
weighted avg       0.77      0.88      0.82       629
Code
print(f"\nConfusion Matrix:\n{confusion_matrix(y, y_pred)}")

Confusion Matrix:
[[552   0]
 [ 77   0]]
Code
coefs = pd.DataFrame({"Feature":X.columns,"Coefficient":lr.coef_[0]}).sort_values(
    "Coefficient", ascending=False)
print(f"\nCoefficients:\n{coefs.to_string(index=False)}")

Coefficients:
                Feature  Coefficient
                 ln_enc     0.547115
       num_installments     0.245322
          interest_rate     0.136327
days_to_first_repayment    -0.010951
                 lt_enc    -0.030873
             log_amount    -0.446300

9.4 Business Interpretation of Significant Coefficients

The model achieves AUC = 0.685, indicating useful discriminative power using only information available at disbursement. Two predictors are statistically significant at α = 0.05:

Number of Installments (OR = 1.33, p < 0.001): Each additional installment multiplies the odds of arrears by 1.33, holding all other factors constant. For a non-technical manager: “A loan structured with 9 monthly installments is approximately 1.33³ = 2.35× more likely to fall into arrears than an equivalent 6-installment loan, and a 12-installment loan is 1.33⁶ = 5.2× more likely. Installment count is our most reliable arrears signal at the point of approval.”

Log Loan Amount (OR = 0.64, p = 0.016): Each unit increase in log loan amount (approximately a 2.7× increase in naira terms) reduces arrears odds by 36%. In plain terms: larger loans are less likely to default. This is consistent with the amount-band finding in visualisation — loans below ₦10M have a 50% arrears rate, while loans above ₦100M have only 4.7%. For the credit committee: “Facility size is a genuine risk discriminator. Small loans carry disproportionate default risk — likely because smaller borrowers have less financial resilience and receive less rigorous underwriting.”

Deployment recommendation: Implement two pre-disbursement flags: (1) Number of installments ≥ 9 → mandatory secondary credit committee review; (2) Loan amount below ₦10M → require enhanced financial statement documentation. Both flags are derivable from information already captured in the LMS at the point of application and require no new data infrastructure.


10. Integrated Findings

The five analytical techniques build a mutually reinforcing chain of evidence.

EDA established the data landscape — 661 loans over 15 months — and identified two quality issues: time-of-day timestamps requiring truncation, and missing closure dates structurally expected for 358 open loans. It also profiled Number of Installments for the first time, revealing a concentration at 6 (74%) with meaningful long-tail risk at 9 and 12. Visualisation transformed these patterns into strategic insight: five plots collectively showed that portfolio growth has been sustained and strong, that BL Deferred Principal loans are performing perfectly (0% arrears), and that the installment-count risk gradient is steep and operationally visible.

Hypothesis testing added rigour — confirming that tiered pricing across products is statistically real (Kruskal-Wallis p < 0.001), and delivering an equally important null result: loan lifecycle stage does not predict arrears (p = 0.459), preventing a misguided policy intervention. Correlation analysis revealed a new structural relationship — rate and installment count are positively correlated (ρ = +0.354) — suggesting the organisation’s product design already partially accounts for repayment risk, but has not translated this into an explicit approval-stage risk flag. Logistic regression unified all variables into a single framework, identifying Number of Installments (OR = 1.33) and Loan Amount (OR = 0.64) as the two independently significant predictors, with an AUC of 0.685.

Single integrated recommendation: Introduce a Pre-Disbursement Installment and Size Checklist: any loan application with 9 or more installments, or a facility below ₦10M, should trigger mandatory secondary credit committee review before disbursement. These two criteria together identify the highest-risk segments in the portfolio, are supported by five independent analytical techniques, and can be implemented within the existing approval workflow without new technology.


11. Limitations & Further Work

1. BL Deferred Principal excluded from regression. The 0% arrears rate for this product causes perfect separation in the logistic model, forcing its exclusion from the regression. With a larger sample (n=32 is small) and more time, a penalised regression (Firth’s logistic regression) would handle this without excluding the product.

2. Class imbalance (88% performing, 12% arrears). The logistic model trained on imbalanced data underperforms at identifying true arrears cases (low sensitivity). SMOTE oversampling or cost-sensitive learning would improve recall for the arrears class — particularly important since the cost of a missed arrears case (credit loss) substantially exceeds the cost of a false alarm (unnecessary review).

3. Number of Installments not fully explained. The installment variable is new to this dataset and its determinants are not captured here. Understanding why some borrowers receive 9–12 installments (is it negotiated? product-driven? credit committee discretion?) would clarify whether the arrears association is causal or mediated by an unobserved credit quality variable.

4. No borrower-level covariates. Industry sector, years in operation, prior credit history, and collateral type — standard credit application fields — are absent. Incorporating these from credit files would likely push AUC above 0.80.

5. Fifteen-month window crosses a calendar year. The data spans March 2025 to May 2026, capturing seasonal demand cycles, but not long enough to fully model them. A 24–36 month panel with seasonal dummy variables or a time series component would separate cyclical from structural arrears patterns.


References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

[Your Name]. (2026). Business loan portfolio dataset — March 2025 to May 2026 [Dataset]. Collected from [Organisation Name / Department], Lagos, Nigeria. Data available on request from the author.

McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.

Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace.

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wilkinson, L. (2005). The grammar of graphics (2nd ed.). Springer.


Appendix: AI Usage Statement

Claude (Anthropic) was used during the preparation of this assignment to assist with structuring the Quarto document template, generating R and Python code scaffolding for the five analytical sections, and advising on appropriate handling of the perfect separation issue in the logistic regression (BL Deferred Principal exclusion). All analytical decisions — the identification of Number of Installments as the primary new predictor variable, the choice and justification of each hypothesis test, the interpretation of the null result for Loan Type, the interpretation of regression coefficients, the integrated recommendation regarding installment count and loan size thresholds, and all limitations identified — were made independently by the author based on domain knowledge and direct review of every model output. The dataset was collected, extracted, and verified by the author from the organisation’s internal loan management system. The author takes full responsibility for all conclusions presented and is prepared to explain and defend every result in the viva voce examination.


Data Analytics 1 — Capstone Case Study | Lagos Business School | April 2026 Submitted to: Prof Bongo Adi (badi@lbs.edu.ng)