Predicting Business Loan Arrears: An Exploratory and Inferential Analytics Study of a Nigerian Commercial Lending Portfolio (May 2025 – April 2026)

Author

[Anurika Orabuche]

Published

May 18, 2026

1. Executive Summary

This study analyses 629 business loans disbursed by a Nigerian commercial lending institution between May 2025 and April 2026, representing a total portfolio value of ₦38.93 billion over a twelve-month period. Of these, 331 loans remain open (Active or In Arrears) at the time of data extraction — their absence of a closure date reflects current open status, not a data entry error. With 76 loans (12.08%, totalling approximately ₦2.94 billion) currently in arrears, credit risk management has become a central operational priority as monthly disbursements grew from ₦2.07 billion in August 2025 to ₦6.03 billion in April 2026.

Five analytical techniques are applied to this real dataset, which introduces Number of Installments as a key new variable. Exploratory data analysis reveals that loan amounts are severely right-skewed (median ₦35M vs mean ₦61.9M) and identifies data quality issues including activation timestamps with time-of-day components. Data visualisation tells a coherent portfolio story: arrears concentrate in standard Business Loans (14.2%), loans with 12 installments (33.3%), and facilities below ₦10M (42.9%). Hypothesis testing confirms that interest rates differ significantly across loan products (Kruskal-Wallis p < 0.001), and delivers an important null result — loan lifecycle stage is not associated with arrears (p = 0.547) — preventing a misguided policy intervention. Correlation analysis uncovers a strong negative rate–amount relationship (ρ = −0.563) and a meaningful installments–rate correlation (ρ = +0.372). Logistic regression identifies Number of Installments (OR = 1.32, p < 0.001) and log Loan Amount (OR = 0.64, p = 0.015) as the two statistically significant pre-disbursement predictors of arrears, producing a model AUC of 0.668.

The primary recommendation is that the organisation introduce a pre-disbursement review trigger for loans with nine or more installments and loans below ₦10M — the two most concentrated arrears risk dimensions identified across all five techniques.

2. Professional Disclosure

Job Title: [Head of Internal Control and Audit] Organisation Type: Nigerian Commercial Lending Institution Sector: Financial Services / Business Credit

Technique Justifications

Exploratory Data Analysis (EDA): In my role I review loan application files and monthly portfolio performance reports. Before drawing any conclusions I must audit the quality and shape of the data. This dataset contains activation timestamps with time-of-day precision and a severely right-skewed loan amount distribution — both of which require deliberate pre-processing decisions. Critically, the dataset’s missing closure dates reflect open loans rather than data entry errors; documenting this correctly prevents misinterpretation of 331 still-active facilities as missing data.

Data Visualisation: Credit committee presentations and board risk briefings demand visual storytelling. A chart showing that 12-installment loans carry a 33.3% arrears rate — nearly three times the portfolio average — communicates risk concentration in one glance. In my day-to-day role, monthly portfolio dashboards are the primary briefing tool for senior leadership, and the five-plot narrative here directly mirrors that professional context.

Hypothesis Testing: Before recommending policy changes I must establish statistical rigour. The null result — that loan lifecycle stage does not predict arrears (p = 0.547) — is as valuable as a significant result; it prevents the organisation from introducing an approval friction point (screening by New/Renewal/Top Up status) that has no evidential basis in this data.

Correlation Analysis: Understanding how loan size, pricing, and installment structure relate to one another is central to every credit strategy discussion I participate in. The positive correlation between installment count and interest rate (ρ = +0.372) reveals that our product design already partially embeds installment risk — but this has not yet been translated into an explicit pre-disbursement flag.

Logistic Regression: The operational goal is a pre-disbursement risk score: given the observable characteristics of a loan at approval time, what is the probability it falls into arrears? The model’s installment coefficient (OR = 1.32 per additional installment) and amount coefficient (OR = 0.64 per log-unit increase) provide two concrete rules that can be embedded in the loan approval workflow today.

3. Data Collection & Sampling

3.1 Source and Collection Method

The dataset was extracted from the organisation’s internal Loan Management System (LMS), which records all loan approvals, disbursements, and repayment events. The extract was generated via a direct database query covering all loans with an Activation Date between 1 May 2025 and 30 April 2026, providing a clean twelve-month analytical window. Data was exported as a Microsoft Excel file and loaded into R for analysis. Loan identifiers have been replaced with neutral sequential codes (Loan_001, Loan_002, …) and the organisation name is withheld; no other data has been modified. No external or publicly available datasets supplement this analysis.

3.2 Sampling Frame

The sampling frame is the complete population of business loans disbursed during the twelve-month study window — a census of all 629 loan records activated in the LMS between May 2025 and April 2026. No random sampling was required; all disbursed loans within the date window are included, making the analysis free of sampling bias at the data collection stage.

3.3 Variables

Variable	Type	Description
`loan_id`	Character	Anonymised sequential identifier (Loan_001 … Loan_629)
`activation_date`	Date	Date the loan was approved and activated
`first_repayment_date`	Date	Scheduled date of first repayment
`closed_date`	Date	Date loan was fully repaid — NA means the loan is still open
`account_state`	Categorical	Closed, Active, or In Arrears
`loan_name`	Categorical	Product name (4 types)
`loan_amount`	Numeric (₦)	Principal disbursed
`interest_rate`	Numeric (%)	Monthly interest rate
`num_installments`	Integer	Number of scheduled repayment installments
`loan_type`	Categorical	Lifecycle stage: New, Renewal, Top Up
`duration_days`	Numeric	Days from activation to closure (NA = still open)
`days_to_first_repayment`	Numeric	Days from activation to first repayment (derived)
`in_arrears`	Binary 0/1	1 if account_state == “In Arrears” (derived outcome)
`loan_open`	Binary 0/1	1 if the loan is still open — Active or In Arrears (derived)

3.4 Note on Missing Closed Dates

A missing closed_date does not indicate a data quality problem. It means the loan is still open at the time of extraction — either Active (repayments ongoing) or In Arrears (repayments overdue). Of the 629 loans, 298 are Closed (have a recorded closure date) and 331 are still open (255 Active + 76 In Arrears). This is correctly reflected in the loan_open variable and in all analyses throughout this document.

3.5 Time Period and Ethics

Period: 1 May 2025 to 30 April 2026 (12 months). All data relates to corporate business loan accounts; no personally identifiable information (PII) relating to natural persons is included. Loan identifiers have been anonymised. The organisation name is withheld to protect commercial confidentiality. Analysis was conducted with the knowledge and approval of the relevant department head. Data is available on request from the author.

4. Data Description

4.1 Load and Prepare Data

Code

library(tidyverse)
library(skimr)
library(janitor)
library(lubridate)
library(knitr)
library(kableExtra)
library(glue)
library(scales)

df <- read_csv("loan_data_final.csv", show_col_types = FALSE) |>
  mutate(
    activation_date      = ymd(activation_date),
    first_repayment_date = ymd(first_repayment_date),
    closed_date          = ymd(closed_date),
    in_arrears           = as.integer(in_arrears),
    loan_open            = as.integer(loan_open),
    loan_type            = factor(loan_type, levels = c("New","Renewal","Top Up")),
    loan_name_short = case_when(
      str_detect(loan_name, "LPO")       ~ "LPO Financing",
      str_detect(loan_name, "Quarterly") ~ "BL Quarterly Deferred",
      str_detect(loan_name, "Deferred")  ~ "BL Deferred Principal",
      TRUE                               ~ "Business Loan Std"
    ),
    loan_name_short = factor(loan_name_short,
      levels = c("Business Loan Std","LPO Financing",
                 "BL Quarterly Deferred","BL Deferred Principal")),
    log_amount  = log(loan_amount),
    amount_bin  = cut(loan_amount,
      breaks = c(0, 10e6, 25e6, 50e6, 100e6, Inf),
      labels = c("<₦10M","₦10–25M","₦25–50M","₦50–100M",">₦100M")),
    install_grp = case_when(
      num_installments <= 3  ~ "1–3",
      num_installments <= 6  ~ "4–6",
      num_installments <= 9  ~ "7–9",
      TRUE                   ~ "10–12"
    ),
    month = floor_date(activation_date, "month")
  )

glimpse(df)

Rows: 629
Columns: 19
$ loan_id                 <chr> "Loan_001", "Loan_002", "Loan_003", "Loan_004"…
$ activation_date         <date> 2025-10-04, 2026-01-30, 2025-07-03, 2025-11-1…
$ first_repayment_date    <date> 2025-11-04, 2026-03-05, 2025-08-02, 2025-12-1…
$ closed_date             <date> 2026-01-16, NA, 2025-09-29, 2026-04-23, 2026-…
$ account_state           <chr> "Closed", "Active", "Closed", "Closed", "Close…
$ loan_name               <chr> "Business Loan Deferred Principal", "Business …
$ loan_amount             <dbl> 200000000, 150000000, 50000000, 17000000, 7000…
$ interest_rate           <dbl> 5.740, 5.740, 4.800, 6.830, 6.050, 3.900, 5.74…
$ num_installments        <dbl> 6, 6, 2, 6, 6, 1, 6, 6, 6, 1, 4, 6, 6, 6, 6, 6…
$ loan_type               <fct> Renewal, Top Up, Renewal, New, New, Renewal, T…
$ duration_days           <dbl> 103, NA, 87, 162, 92, 33, 54, 200, 78, 76, NA,…
$ days_to_first_repayment <dbl> 30, 33, 29, 28, 30, 30, 30, 30, 19, 19, 30, 31…
$ in_arrears              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0…
$ loan_open               <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1…
$ loan_name_short         <fct> BL Deferred Principal, Business Loan Std, LPO …
$ log_amount              <dbl> 19.11383, 18.82615, 17.72753, 16.64872, 18.064…
$ amount_bin              <fct> >₦100M, >₦100M, ₦25–50M, ₦10–25M, ₦50–100M, ₦2…
$ install_grp             <chr> "4–6", "4–6", "1–3", "4–6", "4–6", "1–3", "4–6…
$ month                   <date> 2025-10-01, 2026-01-01, 2025-07-01, 2025-11-0…

Code

import pandas as pd
import numpy as np

df_py = pd.read_csv("loan_data_final.csv",
    parse_dates=["activation_date","first_repayment_date","closed_date"])

df_py["loan_type"] = pd.Categorical(df_py["loan_type"],
    categories=["New","Renewal","Top Up"])
df_py["loan_name_short"] = df_py["loan_name"].apply(lambda x:
    "LPO Financing"         if "LPO"       in str(x) else
    "BL Quarterly Deferred" if "Quarterly" in str(x) else
    "BL Deferred Principal" if "Deferred"  in str(x) else
    "Business Loan Std")
df_py["log_amount"] = np.log(df_py["loan_amount"])
df_py["month"]      = df_py["activation_date"].dt.to_period("M")
df_py["amount_bin"] = pd.cut(df_py["loan_amount"],
    bins=[0, 10e6, 25e6, 50e6, 100e6, float("inf")],
    labels=["<10M","10-25M","25-50M","50-100M",">100M"])

# Confirm open vs closed
print(f"Total loans: {len(df_py)}")

Total loans: 629

Code

print(f"Closed (have closure date): {df_py['closed_date'].notna().sum()}")

Closed (have closure date): 298

Code

print(f"Still open (no closure date): {df_py['closed_date'].isna().sum()}")

Still open (no closure date): 331

Code

print(f"  of which Active:     {(df_py['account_state']=='Active').sum()}")

  of which Active:     255

Code

print(f"  of which In Arrears: {(df_py['account_state']=='In Arrears').sum()}")

  of which In Arrears: 76

4.2 Loan Status Overview

Code

status_tbl <- df |>
  group_by(account_state) |>
  summarise(
    n            = n(),
    pct          = percent(n()/nrow(df), 0.1),
    total_amount = comma(round(sum(loan_amount)/1e9, 2)),
    has_closed_date = sum(!is.na(closed_date)),
    missing_closed  = sum(is.na(closed_date)),
    .groups = "drop"
  )

kable(status_tbl,
      col.names = c("Account State","n","% Portfolio","Total (₦B)",
                    "Has Closure Date","No Closure Date (= Still Open)"),
      caption   = "Table 1: Loan status — missing closed_date = loan still open") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) |>
  row_spec(which(status_tbl$account_state == "In Arrears"), bold = TRUE,
           background = "#FDECEA")

Table 1: Loan status — missing closed_date = loan still open
Account State	n	% Portfolio	Total (₦B)	Has Closure Date	No Closure Date (= Still Open)
Active	255	40.5%	17	0	255
Closed	298	47.4%	19	298	0
In Arrears	76	12.1%	3	0	76

4.3 Summary Statistics

Code

df |>
  select(loan_amount, interest_rate, num_installments,
         days_to_first_repayment, duration_days, in_arrears) |>
  skim() |>
  kable(digits = 2,
        caption = "Table 2: Summary statistics — numeric variables") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)

Table 2: Summary statistics — numeric variables
skim_type	skim_variable	n_missing	complete_rate	numeric.mean	numeric.sd	numeric.p0	numeric.p25	numeric.p50	numeric.p75	numeric.p100	numeric.hist
numeric	loan_amount	0	1.00	61889469.45	64609498.03	846403.00	2.000e+07	3.50e+07	9.0000e+07	3.00e+08	▇▂▁▁▁
numeric	interest_rate	0	1.00	6.08	1.23	2.43	5.600e+00	6.36e+00	6.8300e+00	8.84e+00	▂▃▅▇▃
numeric	num_installments	0	1.00	5.68	1.92	1.00	6.000e+00	6.00e+00	6.0000e+00	1.20e+01	▂▁▇▁▁
numeric	days_to_first_repayment	0	1.00	30.60	5.61	6.00	2.900e+01	3.00e+01	3.1000e+01	8.80e+01	▁▇▁▁▁
numeric	duration_days	331	0.47	117.17	55.71	0.00	8.225e+01	1.09e+02	1.6175e+02	3.04e+02	▂▇▅▃▁
numeric	in_arrears	0	1.00	0.12	0.33	0.00	0.000e+00	0.00e+00	0.0000e+00	1.00e+00	▇▁▁▁▁

4.4 Categorical Profiles

Code

bind_rows(
  df |> count(account_state)   |> mutate(pct=n/sum(n), var="Account State")  |> rename(level=account_state),
  df |> count(loan_type)       |> mutate(pct=n/sum(n), var="Loan Type")      |> rename(level=loan_type),
  df |> count(loan_name_short) |> mutate(pct=n/sum(n), var="Loan Product")   |> rename(level=loan_name_short)
) |>
  mutate(pct = percent(pct, 0.1)) |>
  select(Variable=var, Category=level, n, `%`=pct) |>
  kable(caption = "Table 3: Categorical variable frequencies") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)

Table 3: Categorical variable frequencies
Variable	Category	n	%
Account State	Active	255	40.5%
Account State	Closed	298	47.4%
Account State	In Arrears	76	12.1%
Loan Type	New	296	47.1%
Loan Type	Renewal	213	33.9%
Loan Type	Top Up	120	19.1%
Loan Product	Business Loan Std	471	74.9%
Loan Product	LPO Financing	81	12.9%
Loan Product	BL Quarterly Deferred	45	7.2%
Loan Product	BL Deferred Principal	32	5.1%

5. Technique 1 — Exploratory Data Analysis (EDA)

5.1 Theory

Exploratory Data Analysis (EDA), formalised by Tukey (1977), uses statistical summaries and graphical displays to understand a dataset’s structure before applying formal models. Key activities include distributional assessment, outlier detection, missing-value classification, and data quality auditing. Adi (2026) illustrates with Anscombe’s Quartet that datasets with identical summary statistics can exhibit radically different structures when visualised — making combined numeric and visual EDA indispensable before any inferential work.

5.2 Business Justification

A data audit is the first step in every portfolio review I conduct. Two issues in this dataset require deliberate decisions: activation timestamps include time-of-day components that must be truncated for monthly aggregation, and missing closure dates must be correctly interpreted as open loans rather than data entry errors — a distinction with significant implications for portfolio monitoring.

5.3 Analysis

Code

# Demonstrate that missing closed_date = still open
open_status <- df |>
  mutate(closed_date_present = !is.na(closed_date)) |>
  count(account_state, closed_date_present) |>
  arrange(account_state)

kable(open_status,
      col.names = c("Account State","Has Closed Date","Count"),
      caption   = "Table 4: Closed date presence by account state — confirms missing = still open") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
  row_spec(which(!open_status$closed_date_present), background = "#FFF3CD")

Table 4: Closed date presence by account state — confirms missing = still open
Account State	Has Closed Date	Count
Active	FALSE	255
Closed	TRUE	298
In Arrears	FALSE	76

Code

library(naniar)
df |>
  miss_var_summary() |>
  filter(n_miss > 0) |>
  mutate(interpretation = case_when(
    variable == "closed_date"   ~ "Expected — loan still open (Active or In Arrears)",
    variable == "duration_days" ~ "Expected — derived from closed_date; NA for open loans",
    TRUE ~ "Review required"
  )) |>
  kable(digits = 2,
        caption = "Table 5: Missing value summary with interpretation") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Table 5: Missing value summary with interpretation
variable	n_miss	pct_miss	interpretation
closed_date	331	52.6	Expected — loan still open (Active or In Arrears)
duration_days	331	52.6	Expected — derived from closed_date; NA for open loans

Code

vis_miss(df |> select(loan_amount, interest_rate, num_installments,
                      days_to_first_repayment, closed_date, duration_days)) +
  labs(
    title    = "Figure 1: Missing value map",
    subtitle = "closed_date and duration_days are missing for all 331 still-open loans — not a data quality problem"
  ) +
  theme_minimal(base_size = 11)

Code

q1  <- quantile(df$loan_amount, 0.25)
q3  <- quantile(df$loan_amount, 0.75)
iqr_val <- q3 - q1
upper_fence <- q3 + 1.5 * iqr_val
lower_fence  <- q1 - 1.5 * iqr_val
outliers_df  <- df |> filter(loan_amount > upper_fence | loan_amount < lower_fence)

cat(glue(
  "IQR fences: ₦{comma(round(lower_fence))} — ₦{comma(round(upper_fence))}
Outlier loans (IQR method): {nrow(outliers_df)}
Largest single loan:        ₦{comma(max(df$loan_amount))}
Smallest loan:              ₦{comma(min(df$loan_amount))}"
))

IQR fences: ₦-85,000,000 — ₦195,000,000
Outlier loans (IQR method): 43
Largest single loan:        ₦300,000,000
Smallest loan:              ₦846,403

Code

library(moments)
cat(glue(
  "Loan Amount skewness:      {round(skewness(df$loan_amount), 3)}
Interest Rate skewness:    {round(skewness(df$interest_rate), 3)}
Num Installments skewness: {round(skewness(df$num_installments), 3)}"
))

Loan Amount skewness:      1.938
Interest Rate skewness:    -0.639
Num Installments skewness: 0.066

Code

df |>
  group_by(num_installments) |>
  summarise(
    n            = n(),
    pct          = percent(n()/nrow(df), 0.1),
    mean_amount  = comma(round(mean(loan_amount))),
    mean_rate    = round(mean(interest_rate), 2),
    n_open       = sum(loan_open),
    n_arrears    = sum(in_arrears),
    arrears_rate = percent(mean(in_arrears), 0.1),
    .groups = "drop"
  ) |>
  kable(col.names = c("Installments","n","% of portfolio","Mean amount",
                      "Mean rate (%)","Still open","In Arrears","Arrears rate"),
        caption   = "Table 6: Number of Installments — full profile") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE) |>
  row_spec(which(c(1,2,3,4,5,6,7,8,9,10,11) %in% c(9,11)), # rows for 9 and 12 installments
           bold = TRUE, background = "#FFF3CD")

Table 6: Number of Installments — full profile
Installments	n	% of portfolio	Mean amount	Mean rate (%)	Still open	In Arrears	Arrears rate
1	24	3.8%	67,122,835	4.59	3	2	8.3%
2	30	4.8%	60,076,555	4.17	6	1	3.3%
3	43	6.8%	57,262,679	4.87	20	2	4.7%
4	14	2.2%	70,238,982	5.42	7	0	0.0%
5	1	0.2%	107,290,000	7.29	0	0	0.0%
6	465	73.9%	59,016,150	6.38	250	59	12.7%
7	1	0.2%	8,920,097	7.75	1	1	100.0%
8	4	0.6%	45,500,000	6.23	3	0	0.0%
9	29	4.6%	91,470,835	6.44	26	6	20.7%
10	3	0.5%	146,666,667	5.95	3	0	0.0%
12	15	2.4%	82,414,441	6.24	12	5	33.3%

Code

library(patchwork)

p1 <- ggplot(df, aes(x = loan_amount / 1e6)) +
  geom_histogram(bins = 30, fill = "#378ADD", colour = "white", linewidth = 0.2) +
  geom_vline(xintercept = median(df$loan_amount)/1e6,
             linetype = "dashed", colour = "#E24B4A", linewidth = 0.9) +
  annotate("text", x = median(df$loan_amount)/1e6 + 40, y = 55,
           label = glue("Median ₦{round(median(df$loan_amount)/1e6)}M"),
           size = 3, colour = "#E24B4A") +
  labs(title = "Loan amount distribution",
       x = "₦ million", y = "Count") +
  theme_minimal(base_size = 11)

p2 <- ggplot(df, aes(x = interest_rate)) +
  geom_histogram(bins = 20, fill = "#1D9E75", colour = "white", linewidth = 0.2) +
  labs(title = "Interest rate (%/month)",
       x = "Monthly rate (%)", y = "Count") +
  theme_minimal(base_size = 11)

p3 <- ggplot(df, aes(x = factor(num_installments), fill = install_grp)) +
  geom_bar(colour = "white", linewidth = 0.2) +
  scale_fill_manual(values = c("1–3"="#1D9E75","4–6"="#378ADD",
                               "7–9"="#EF9F27","10–12"="#E24B4A"),
                    name = "Group") +
  labs(title = "Number of installments",
       x = "Installments", y = "Count") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "bottom", legend.text = element_text(size = 8))

p4 <- ggplot(df, aes(x = loan_amount/1e6, y = loan_name_short,
                     fill = loan_name_short)) +
  geom_boxplot(alpha = 0.7, outlier.colour = "#E24B4A", outlier.size = 1.5) +
  scale_fill_manual(values = c("Business Loan Std"    = "#378ADD",
                               "LPO Financing"         = "#EF9F27",
                               "BL Quarterly Deferred" = "#1D9E75",
                               "BL Deferred Principal" = "#A32D2D")) +
  labs(title = "Loan amount by product",
       x = "₦M", y = NULL) +
  theme_minimal(base_size = 11) + theme(legend.position = "none")

(p1 + p2) / (p3 + p4) +
  plot_annotation(
    title = "Figure 2: Distributions of key variables",
    theme = theme(plot.title = element_text(size = 13, face = "bold"))
  )

Code

import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 3, figsize=(13, 4))

axes[0].hist(df_py["loan_amount"]/1e6, bins=30, color="#378ADD", edgecolor="white")
axes[0].axvline(df_py["loan_amount"].median()/1e6, color="#E24B4A",
                linestyle="--", linewidth=1.5,
                label=f'Median ₦{df_py["loan_amount"].median()/1e6:.0f}M')
axes[0].set_title("Loan amount distribution"); axes[0].set_xlabel("₦ million")
axes[0].legend(fontsize=9)

axes[1].hist(df_py["interest_rate"], bins=20, color="#1D9E75", edgecolor="white")
axes[1].set_title("Interest rate distribution"); axes[1].set_xlabel("Monthly rate (%)")

install_counts = df_py["num_installments"].value_counts().sort_index()
colours = ["#E24B4A" if v >= 9 else "#EF9F27" if v == 6 else "#1D9E75"
           for v in install_counts.index]
axes[2].bar(install_counts.index.astype(str), install_counts.values,
            color=colours, edgecolor="white")
axes[2].set_title("Number of installments"); axes[2].set_xlabel("Installments")

plt.tight_layout()
plt.savefig("eda_py.png", dpi=150, bbox_inches="tight")
plt.show()

Code

print(f"Loan Amount skewness:      {df_py['loan_amount'].skew():.3f}")

Loan Amount skewness:      1.943

Code

print(f"Interest Rate skewness:    {df_py['interest_rate'].skew():.3f}")

Interest Rate skewness:    -0.640

Code

print(f"\nMissing values (all = open loans):")


Missing values (all = open loans):

Code

print(df_py.isnull().sum()[df_py.isnull().sum()>0])

closed_date      331
duration_days    331
dtype: int64

5.4 Data Quality Issues Identified and Resolved

#	Issue	Detail	Resolution
1	Activation date includes time-of-day	Timestamps stored with HH:MM:SS precision (e.g. “2025-10-04 21:17:46”)	Applied `floor_date(activation_date, "month")` in R; `dt.to_period("M")` in Python
2	Missing Closed Date (331 / 52.6%)	No closure date recorded for Active (255) and In Arrears (76) loans	Correctly interpreted as still-open loans, not missing data. Confirmed by cross-checking with account_state. Documented via `loan_open` variable.

No Loan Type encoding issues. Loan Type contains exactly three clean, consistent categories — New (296), Renewal (213), Top Up (120). No standardisation required.

5.5 Business Interpretation

Three findings have direct operational consequences. First, the timestamp issue is a reporting system artefact requiring IT attention — loan activation timestamps should record date only, not server time with HH:MM:SS precision, to avoid confusion in monthly reports. Second, the right-skewed loan amount distribution (skewness = 1.943) means the mean (₦61.9M) overstates the typical facility by 77% relative to the median (₦35M) — any board report using average loan size will misrepresent the portfolio’s typical credit exposure. Third, the Number of Installments distribution is concentrated at 6 (73.9% of loans), with risk-significant long tails: 12-installment loans carry a 33.3% arrears rate and 9-installment loans carry 20.7% — both well above the 12.1% portfolio average. These high-installment facilities represent a disproportionate share of credit risk despite their small volume share.

6. Technique 2 — Data Visualisation

6.1 Theory

The grammar of graphics (Wilkinson, 2005), implemented in R’s ggplot2, provides a principled framework for constructing statistical graphics by mapping data attributes to visual channels — position, colour, size, and shape. Effective business visualisation requires matching chart type to the data relationship being shown, minimising non-data ink, and ensuring the key message is legible to non-specialist audiences (Adi, 2026, Ch. 5). The five plots below form a deliberate narrative arc — from portfolio growth, through structural composition, to risk concentration.

6.2 Business Justification

Risk committee briefings depend on visual clarity. A chart showing that 12-installment loans carry a 33.3% arrears rate communicates risk concentration immediately. In my professional role, monthly portfolio dashboards are the primary briefing tool for senior leadership, and these chart choices directly reflect that context.

6.3 Five-Plot Visual Narrative

These five plots tell one connected story: the portfolio has grown consistently and is broadly healthy — but a specific risk concentration defined by installment count and facility size demands targeted underwriting attention.

Code

df_monthly <- df |>
  group_by(month) |>
  summarise(count = n(), total_bn = sum(loan_amount)/1e9, .groups = "drop")

ggplot(df_monthly, aes(x = month)) +
  geom_col(aes(y = total_bn), fill = "#378ADD", alpha = 0.85, width = 20) +
  geom_line(aes(y = count/15), colour = "#E24B4A",
            linewidth = 1.1, group = 1) +
  geom_point(aes(y = count/15), colour = "#E24B4A", size = 2.8) +
  scale_y_continuous(
    name = "Total disbursed (₦ billion)",
    sec.axis = sec_axis(~.*15, name = "Number of loans")
  ) +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  labs(
    title    = "Figure 3: Monthly loan disbursements — May 2025 to April 2026",
    subtitle = "Bars = total value (₦B, left axis)  |  Line = loan count (right axis)",
    x = NULL
  ) +
  theme_minimal(base_size = 11) +
  theme(axis.text.x = element_text(angle = 35, hjust = 1))

Code

ggplot(df, aes(x = loan_name_short, y = loan_amount/1e6,
               fill = loan_name_short)) +
  geom_violin(alpha = 0.55, trim = TRUE, colour = "white") +
  geom_jitter(aes(colour = factor(in_arrears)),
              width = 0.12, size = 1.6, alpha = 0.8) +
  scale_fill_manual(values = c("Business Loan Std"    = "#378ADD",
                               "LPO Financing"         = "#EF9F27",
                               "BL Quarterly Deferred" = "#1D9E75",
                               "BL Deferred Principal" = "#A32D2D")) +
  scale_colour_manual(values = c("0"="#888780","1"="#E24B4A"),
                      labels = c("0"="Performing","1"="In Arrears"),
                      name = "Status") +
  labs(
    title    = "Figure 4: Loan amount and arrears by product type",
    subtitle = "Red dots = In Arrears  |  BL Deferred Principal: 0% arrears across all 32 loans",
    x = "Product", y = "Loan amount (₦M)"
  ) +
  coord_flip() +
  theme_minimal(base_size = 11) + theme(legend.position = "bottom")

Code

ggplot(df, aes(x = interest_rate, fill = loan_name_short)) +
  geom_density(alpha = 0.55, colour = "white") +
  scale_fill_manual(values = c("Business Loan Std"    = "#378ADD",
                               "LPO Financing"         = "#EF9F27",
                               "BL Quarterly Deferred" = "#1D9E75",
                               "BL Deferred Principal" = "#A32D2D"),
                    name = "Product") +
  labs(
    title    = "Figure 5: Interest rate density by product type",
    subtitle = "Business Loan Std clusters at 6.38–6.83%  |  LPO and Deferred products: 3.5–5%",
    x = "Monthly interest rate (%)", y = "Density"
  ) +
  theme_minimal(base_size = 11) + theme(legend.position = "bottom")

Code

arrears_install <- df |>
  filter(num_installments %in% c(1,2,3,6,9,12)) |>
  group_by(num_installments) |>
  summarise(n=n(), rate=mean(in_arrears), .groups="drop")

portfolio_avg <- mean(df$in_arrears)

ggplot(arrears_install, aes(x = factor(num_installments), y = rate,
                             fill = rate > portfolio_avg)) +
  geom_col(width = 0.6, alpha = 0.85) +
  geom_hline(yintercept = portfolio_avg,
             linetype = "dashed", colour = "#888780", linewidth = 0.8) +
  geom_text(aes(label = paste0(round(rate*100,1), "%\nn=", n)),
            vjust = -0.25, size = 3.2) +
  scale_fill_manual(values = c("FALSE"="#1D9E75","TRUE"="#E24B4A"),
                    guide = "none") +
  scale_y_continuous(labels = percent, limits = c(0, 0.45)) +
  annotate("text", x = 0.65, y = portfolio_avg + 0.02,
           label = glue("Portfolio avg {percent(portfolio_avg, 0.1)}"),
           size = 3, colour = "#888780") +
  labs(
    title    = "Figure 6: Arrears rate by number of installments",
    subtitle = "12-installment loans: 33.3% arrears — nearly 3× the portfolio average of 12.1%",
    x = "Number of installments", y = "Arrears rate"
  ) +
  theme_minimal(base_size = 11)

Code

ggplot(df, aes(x = loan_amount/1e6, y = interest_rate,
               colour = factor(in_arrears))) +
  geom_point(alpha = 0.55, size = 1.8) +
  geom_smooth(data = filter(df, in_arrears==0), method = "lm", se = TRUE,
              colour = "#378ADD", fill = "#B5D4F4", linewidth = 0.8) +
  scale_colour_manual(values = c("0"="#378ADD","1"="#E24B4A"),
                      labels = c("0"="Performing","1"="In Arrears"),
                      name = "Status") +
  scale_x_log10(labels = comma) +
  labs(
    title    = "Figure 7: Loan amount vs interest rate — performing vs arrears",
    subtitle = "Log scale. Arrears loans cluster at smaller amounts. Trend line = performing loans only.",
    x = "Loan amount (₦M, log scale)", y = "Monthly rate (%)"
  ) +
  theme_minimal(base_size = 11) + theme(legend.position = "bottom")

Code

import seaborn as sns
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(13, 4))

arr_i = (df_py[df_py["num_installments"].isin([1,2,3,6,9,12])]
    .groupby("num_installments")["in_arrears"]
    .agg(["mean","count"]).reset_index())
colours = ["#E24B4A" if v > df_py["in_arrears"].mean() else "#1D9E75"
           for v in arr_i["mean"]]
axes[0].bar(arr_i["num_installments"].astype(str), arr_i["mean"]*100,
            color=colours, alpha=0.85)
axes[0].axhline(df_py["in_arrears"].mean()*100, linestyle="--",
                color="#888780", linewidth=1, label="Portfolio avg")
axes[0].set_title("Arrears rate by installments (%)", fontsize=11)
axes[0].set_xlabel("Installments"); axes[0].legend(fontsize=9)

arr_b = (df_py.groupby("amount_bin", observed=True)["in_arrears"]
    .agg(["mean","count"]).reset_index())
axes[1].bar(arr_b["amount_bin"].astype(str), arr_b["mean"]*100,
            color="#E24B4A", alpha=0.75)
axes[1].set_title("Arrears rate by loan amount band (%)", fontsize=11)
axes[1].set_xlabel("Amount band"); axes[1].set_ylabel("Arrears rate (%)")

plt.tight_layout()
plt.savefig("viz_py.png", dpi=150, bbox_inches="tight")
plt.show()

6.4 Business Interpretation

The five plots build one strategic argument. Figure 3 establishes sustained portfolio growth — disbursements rose from ₦2.07B in August 2025 to ₦6.03B in April 2026, with loan counts growing from 35 to 87 in the same window. Figure 4 reveals the most striking product-level finding: all 32 BL Deferred Principal loans are performing (0% arrears), while Business Loan Std (n=471) carries the bulk of portfolio risk. Figure 5 confirms clear tiered pricing, with rates separating cleanly by product type. Figure 6 delivers the most actionable operational finding: 12-installment loans carry a 33.3% arrears rate and 9-installment loans carry 20.7% — both nearly three times the portfolio average. Figure 7 shows that arrears loans cluster in the lower-amount region, consistent with the logistic regression’s amount coefficient.

7. Technique 3 — Hypothesis Testing

7.1 Theory

Hypothesis testing provides a formal framework for distinguishing real effects from random variation. The analyst specifies H₀ (no effect) and H₁ (an effect exists), selects a test appropriate to the data type and distributional assumptions, and evaluates both p-value and effect size. Equally important: a null result is not a failure — it is evidence that a suspected pattern does not exist, and it prevents the analyst from recommending policy changes based on noise (Adi, 2026, Ch. 6).

7.2 Business Justification

Two questions carry direct policy implications: (1) Do interest rates genuinely differ across loan products? (2) Is loan lifecycle stage associated with arrears? Confirming question one validates the organisation’s tiered pricing policy. The null result on question two — that New, Renewal, and Top Up loans have statistically identical arrears rates — prevents an approval process change that would create friction without improving risk detection.

7.3 Hypothesis 1 — Do Interest Rates Differ Across Loan Product Types?

H₀: The distribution of monthly interest rates is identical across all four loan product categories. H₁: At least one product type has a significantly different interest rate distribution. Test: Kruskal-Wallis — normality violated (Shapiro-Wilk: Amount W=0.752, p<0.001; Rate W=0.877, p<0.001). α = 0.05

Code

library(rstatix)
library(effectsize)

sw_amt  <- shapiro.test(sample(df$loan_amount,   50, replace=FALSE))
sw_rate <- shapiro.test(sample(df$interest_rate, 50, replace=FALSE))
cat(glue(
  "Shapiro-Wilk — Loan Amount (n=50):   W={round(sw_amt$statistic,3)},  p={format(sw_amt$p.value, scientific=TRUE)}
Shapiro-Wilk — Interest Rate (n=50): W={round(sw_rate$statistic,3)}, p={format(sw_rate$p.value, scientific=TRUE)}
→ Normality violated. Using Kruskal-Wallis.\n"
))

Shapiro-Wilk — Loan Amount (n=50):   W=0.69,  p=5.460337e-09
Shapiro-Wilk — Interest Rate (n=50): W=0.943, p=1.704095e-02
→ Normality violated. Using Kruskal-Wallis.

Code

kw1 <- kruskal.test(interest_rate ~ loan_name_short, data = df)
cat(glue(
  "Kruskal-Wallis (Rate ~ Product): χ²({kw1$parameter}) = {round(kw1$statistic,3)},
  p = {format(kw1$p.value, scientific=TRUE)}\n"
))

Kruskal-Wallis (Rate ~ Product): χ²(3) = 293.062,
p = 3.156798e-63

Code

df |>
  group_by(loan_name_short) |>
  summarise(n=n(), mean_rate=round(mean(interest_rate),3),
            median_rate=round(median(interest_rate),3),
            sd=round(sd(interest_rate),3), .groups="drop") |>
  kable(col.names=c("Product","n","Mean (%)","Median (%)","SD"),
        caption="Table 7: Interest rate by loan product") |>
  kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE)

Table 7: Interest rate by loan product
Product	n	Mean (%)	Median (%)	SD
Business Loan Std	471	6.621	6.380	0.770
LPO Financing	81	4.254	4.200	0.691
BL Quarterly Deferred	45	5.158	4.935	0.665
BL Deferred Principal	32	4.087	3.500	1.241

Code

ggplot(df, aes(x=loan_name_short, y=interest_rate, fill=loan_name_short)) +
  geom_boxplot(alpha=0.7, outlier.colour="#E24B4A", outlier.size=1.5) +
  scale_fill_manual(values=c("Business Loan Std"="#378ADD","LPO Financing"="#EF9F27",
                             "BL Quarterly Deferred"="#1D9E75","BL Deferred Principal"="#A32D2D")) +
  labs(title="Figure 8: Interest rate by loan product",
       subtitle=glue("Kruskal-Wallis χ²({kw1$parameter})={round(kw1$statistic,1)}, p<0.001 — differences confirmed"),
       x="Product", y="Monthly rate (%)") +
  coord_flip() + theme_minimal(base_size=11) + theme(legend.position="none")

Code

from scipy import stats as sc

groups = [g["interest_rate"].values for _,g in df_py.groupby("loan_name_short")]
h, p = sc.kruskal(*groups)
print(f"Kruskal-Wallis (Rate ~ Product): H={h:.3f}, p={p:.2e}")

Kruskal-Wallis (Rate ~ Product): H=293.062, p=3.16e-63

Code

print("\nMean rate by product:")


Mean rate by product:

Code

print(df_py.groupby("loan_name_short")["interest_rate"]
    .agg(["count","mean","median","std"]).round(3))

                       count   mean  median    std
loan_name_short                                   
BL Deferred Principal     32  4.087   3.500  1.241
BL Quarterly Deferred     45  5.158   4.935  0.665
Business Loan Std        471  6.621   6.380  0.770
LPO Financing             81  4.254   4.200  0.691

Result: χ²(3) = 293.06, p < 0.001. Reject H₀.

Business interpretation: The pricing differential across products is statistically confirmed, not random. Business Loan Std averages 6.62%/month versus 4.26% for LPO Financing and 4.09% for BL Deferred Principal. For the credit committee: “Our tiered pricing is working as designed — each product carries a genuinely distinct rate band, and these differences are highly consistent. This pricing architecture is statistically valid.”

7.4 Hypothesis 2 — Is Loan Lifecycle Stage Associated with Arrears?

H₀: Arrears status is independent of loan lifecycle stage (New / Renewal / Top Up). H₁: Arrears status is associated with loan lifecycle stage. Test: Pearson Chi-squared test of independence. α = 0.05

Code

ct <- df |> select(loan_type, in_arrears) |> table()
chi_res <- chisq.test(ct)
cramers_v <- sqrt(chi_res$statistic / (sum(ct) * (min(dim(ct)) - 1)))

cat(glue(
  "Chi-squared (Arrears ~ Loan Type):
   χ²({chi_res$parameter}) = {round(chi_res$statistic,3)},
   p = {round(chi_res$p.value,4)},
   Cramér's V = {round(cramers_v,3)}
   → FAIL TO REJECT H₀\n"
))

Chi-squared (Arrears ~ Loan Type):
χ²(2) = 1.207,
p = 0.5469,
Cramér's V = 0.044
→ FAIL TO REJECT H₀

Code

df |>
  group_by(loan_type) |>
  summarise(n=n(), n_open=sum(loan_open), n_arrears=sum(in_arrears),
            arrears_rate=percent(mean(in_arrears),0.1), .groups="drop") |>
  kable(col.names=c("Loan Type","n","Still Open","In Arrears","Arrears Rate"),
        caption="Table 8: Arrears by lifecycle stage — null result") |>
  kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE)

Table 8: Arrears by lifecycle stage — null result
Loan Type	n	Still Open	In Arrears	Arrears Rate
New	296	167	40	13.5%
Renewal	213	78	22	10.3%
Top Up	120	86	14	11.7%

Code

from scipy.stats import chi2_contingency
import numpy as np

ct_py = pd.crosstab(df_py["loan_type"], df_py["in_arrears"])
chi2, p_val, dof, _ = chi2_contingency(ct_py)
cv = np.sqrt(chi2 / (ct_py.values.sum() * (min(ct_py.shape)-1)))
print(f"Chi-squared: χ²({dof}) = {chi2:.3f}, p = {p_val:.4f}")

Chi-squared: χ²(2) = 1.207, p = 0.5469

Code

print(f"Cramér's V = {cv:.3f}  (negligible effect size)")

Cramér's V = 0.044  (negligible effect size)

Code

print("\nArrears rate by loan type:")


Arrears rate by loan type:

Code

print(df_py.groupby("loan_type")["in_arrears"]
    .agg(["count","sum","mean"]).round(3))

           count  sum   mean
loan_type                   
New          296   40  0.135
Renewal      213   22  0.103
Top Up       120   14  0.117

Result: χ²(2) = 1.207, p = 0.547, Cramér’s V = 0.044. Fail to reject H₀.

Business interpretation: This null result is among the most practically important findings in the study. Loan lifecycle stage — whether a loan is New (13.5%), Renewal (10.3%), or Top Up (11.7%) — is not significantly associated with arrears. The differences are negligible and entirely consistent with random sampling variation. For the credit committee: “Do not introduce lifecycle stage as a credit screening criterion. The data provides no evidence that Renewals or Top Ups are riskier than new loans. Such a rule would add process friction without improving risk detection — resources are better directed at installment count and facility size.”

8. Technique 4 — Correlation Analysis

8.1 Theory

Correlation analysis measures the strength and direction of association between pairs of numeric variables. Spearman’s ρ is used throughout as the non-parametric alternative appropriate when data are skewed. The foundational caveat: correlation is not causation — a significant correlation may reflect a common underlying driver rather than a direct relationship (Adi, 2026, Ch. 8).

8.2 Business Justification

Understanding how loan size, pricing, and installment structure relate to one another drives every credit strategy conversation I participate in. The positive correlation between installment count and interest rate (ρ = +0.372) suggests the organisation’s product design already partially encodes installment risk — but this has not been made explicit in the pre-disbursement approval process.

8.3 Analysis

Code

library(corrplot)
library(Hmisc)

corr_vars <- df |>
  select(loan_amount, interest_rate, num_installments,
         days_to_first_repayment, in_arrears) |>
  drop_na()

corr_mat  <- cor(corr_vars, method = "spearman")
corr_pmat <- rcorr(as.matrix(corr_vars), type = "spearman")$P

# Set diagonal of p-matrix to 1 to avoid dimension mismatch in corrplot
diag(corr_pmat) <- 1

corrplot(corr_mat,
  method="color", type="upper",
  addCoef.col="black", number.cex=0.82,
  tl.col="black", tl.srt=45,
  p.mat=corr_pmat, sig.level=0.05, insig="blank",
  col=colorRampPalette(c("#E24B4A","white","#378ADD"))(200),
  title="Figure 9: Spearman correlation matrix (blank = p ≥ 0.05)",
  mar=c(0,0,2,0))

Code

as.data.frame(round(corr_mat, 3)) |>
  kable(caption="Table 9: Spearman correlation coefficients") |>
  kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE)

Table 9: Spearman correlation coefficients
	loan_amount	interest_rate	num_installments	days_to_first_repayment	in_arrears
loan_amount	1.000	-0.563	0.000	0.137	-0.141
interest_rate	-0.563	1.000	0.372	-0.011	0.146
num_installments	0.000	0.372	1.000	0.014	0.136
days_to_first_repayment	0.137	-0.011	0.014	1.000	-0.052
in_arrears	-0.141	0.146	0.136	-0.052	1.000

Code

import seaborn as sns
import matplotlib.pyplot as plt

cols = ["loan_amount","interest_rate","num_installments",
        "days_to_first_repayment","in_arrears"]
spearman_corr = df_py[cols].corr(method="spearman").round(3)

fig, ax = plt.subplots(figsize=(7,6))
sns.heatmap(spearman_corr, annot=True, fmt=".3f", cmap="RdBu_r",
            center=0, vmin=-1, vmax=1, ax=ax,
            linewidths=0.5, cbar_kws={"shrink":0.8})

<Axes: >

Code

ax.set_title("Spearman correlation matrix", fontsize=11)

Text(0.5, 1.0, 'Spearman correlation matrix')

Code

plt.tight_layout()
plt.savefig("corr_py.png", dpi=150, bbox_inches="tight")
plt.show()

Code

print(spearman_corr)

                         loan_amount  ...  in_arrears
loan_amount                    1.000  ...      -0.141
interest_rate                 -0.563  ...       0.146
num_installments               0.000  ...       0.136
days_to_first_repayment        0.137  ...      -0.052
in_arrears                    -0.141  ...       1.000

[5 rows x 5 columns]

8.4 Key Correlations and Business Implications

1. Loan Amount ↔︎ Interest Rate (ρ = −0.563, moderate-strong negative, p < 0.001) The strongest correlation: larger loans attract lower monthly rates. This is consistent with tiered pricing where larger, typically better-collateralised clients negotiate lower rates. BL Deferred Principal loans (large amounts, lowest rates, 0% arrears) drive much of this relationship. Business implication: size and pricing are jointly reflecting credit quality, but neither alone is a sufficient risk predictor.

2. Interest Rate ↔︎ Number of Installments (ρ = +0.372, moderate positive, p < 0.001) Higher-rate loans tend to have more installments. This suggests the organisation’s product design already partially encodes repayment risk in its installment structure — longer repayment schedules accompany riskier, higher-rate facilities. This relationship explains why installment count predicts arrears even after controlling for rate in the regression.

3. Interest Rate ↔︎ In Arrears (ρ = +0.146, small positive, p < 0.001) Higher rates are modestly associated with arrears. Rate partially captures credit risk but is insufficient as a standalone screening tool — confirmed by its non-significance in the multivariate regression.

4. Number of Installments ↔︎ In Arrears (ρ = +0.136, small positive, p < 0.001) More installments are associated with higher arrears probability. With ρ = 0.136, the raw correlation is modest, but the installment coefficient in the regression (OR = 1.32, p < 0.001) confirms this relationship holds independently when other variables are controlled.

Causation note: The installments–arrears correlation likely reflects borrower credit quality (weaker borrowers may negotiate longer repayment windows), not a direct effect of installment count on default behaviour. Assigning more installments does not cause arrears — both may be driven by the borrower’s underlying financial strength. Controlling experiments would be needed to establish causation.

9. Technique 5 — Logistic Regression

9.1 Theory

Logistic regression models the log-odds of a binary outcome as a linear function of predictor variables. It is appropriate when the outcome is categorical (arrears = 1, performing = 0). Coefficients are exponentiated to yield odds ratios (ORs): OR > 1 increases arrears odds; OR < 1 decreases them. Performance is assessed via AUC — the probability that the model ranks a randomly chosen arrears loan above a randomly chosen performing loan. An AUC of 0.668 means the model achieves this 66.8% of the time, compared to 50% by random chance (Adi, 2026, Ch. 13).

Note on perfect separation: BL Deferred Principal loans have 0% arrears across all 32 loans. Including this product as a dummy predictor causes complete separation — the algorithm would assign an infinitely negative coefficient with unreliable standard errors. This product is therefore excluded from the regression model but fully documented in Sections 5 and 6. Its 0% arrears rate is an important finding: the Deferred Principal structure is associated with near-zero credit loss.

9.2 Business Justification

The operational goal is a pre-disbursement risk score using only information available at the point of loan approval. With an AUC of 0.668, this model correctly ranks a risky loan above a safe loan about 67% of the time — meaningfully better than guessing, and sufficient to serve as a triage tool flagging applications for secondary review.

9.3 Model

Code

library(broom)
library(pROC)
library(caret)

model_df <- df |>
  filter(loan_name_short != "BL Deferred Principal") |>
  mutate(
    loan_type       = relevel(loan_type, ref = "New"),
    loan_name_short = relevel(
      factor(loan_name_short,
             levels=c("Business Loan Std","LPO Financing","BL Quarterly Deferred")),
      ref = "Business Loan Std")
  ) |>
  select(in_arrears, log_amount, interest_rate, days_to_first_repayment,
         num_installments, loan_type, loan_name_short) |>
  drop_na()

logit_mod <- glm(
  in_arrears ~ log_amount + interest_rate + days_to_first_repayment +
               num_installments + loan_type + loan_name_short,
  data=model_df, family=binomial(link="logit")
)

tidy(logit_mod, exponentiate=TRUE, conf.int=TRUE) |>
  mutate(
    term = str_replace_all(term, c(
      "log_amount"                           = "Loan amount (log ₦)",
      "interest_rate"                        = "Interest rate (%/month)",
      "days_to_first_repayment"              = "Days to first repayment",
      "num_installments"                     = "Number of installments",
      "loan_typeRenewal"                     = "Loan type: Renewal [ref=New]",
      "loan_typeTop Up"                      = "Loan type: Top Up [ref=New]",
      "loan_name_shortLPO Financing"         = "Product: LPO Financing [ref=BL Std]",
      "loan_name_shortBL Quarterly Deferred" = "Product: BL Qtrly Deferred [ref=BL Std]"
    )),
    sig = case_when(p.value<0.001~"***", p.value<0.01~"**",
                    p.value<0.05~"*", TRUE~"")
  ) |>
  select(Term=term, OR=estimate, `CI Low`=conf.low,
         `CI High`=conf.high, `p-value`=p.value, Sig=sig) |>
  mutate(across(where(is.numeric), ~round(.,4))) |>
  kable(caption="Table 10: Logistic regression — odds ratios (highlighted rows: p < 0.05)") |>
  kable_styling(bootstrap_options=c("striped","hover","condensed"), full_width=FALSE) |>
  row_spec(which(tidy(logit_mod)$p.value < 0.05), bold=TRUE, background="#FFF3CD")

Table 10: Logistic regression — odds ratios (highlighted rows: p < 0.05)
Term	OR	CI Low	CI High	p-value	Sig
(Intercept)	24.9642	0.0139	53648.4513	0.4050
Loan amount (log ₦)	0.6394	0.4419	0.9097	0.0150	*
Interest rate (%/month)	1.1790	0.7780	1.7806	0.4355
Days to first repayment	0.9937	0.9424	1.0412	0.8054
Number of installments	1.3196	1.1258	1.5426	0.0005	***
Loan type: Renewal [ref=New]	1.0132	0.5577	1.8048	0.9649
Loan type: Top Up [ref=New]	0.9854	0.4717	1.9665	0.9677
Product: LPO Financing [ref=BL Std]	2.4230	0.5407	10.4983	0.2398
Product: BL Qtrly Deferred [ref=BL Std]	0.6815	0.1450	2.3460	0.5784

Code

pred_prob  <- predict(logit_mod, type="response")
pred_class <- if_else(pred_prob >= 0.5, 1L, 0L)

conf_mat <- confusionMatrix(
  factor(pred_class, levels=c(0,1)),
  factor(model_df$in_arrears, levels=c(0,1)),
  positive="1")

cat("=== Confusion Matrix ===\n"); print(conf_mat$table)

=== Confusion Matrix ===

          Reference
Prediction   0   1
         0 521  75
         1   0   1

Code

cat(glue("\nAccuracy:    {round(conf_mat$overall['Accuracy']*100,1)}%
Sensitivity: {round(conf_mat$byClass['Sensitivity']*100,1)}%
Specificity: {round(conf_mat$byClass['Specificity']*100,1)}%\n"))

Accuracy:    87.4%
Sensitivity: 1.3%
Specificity: 100%

Code

roc_obj <- roc(model_df$in_arrears, pred_prob, quiet=TRUE)
cat(glue("AUC: {round(auc(roc_obj),4)}\n"))

AUC: 0.668

Code

plot(roc_obj, col="#378ADD", lwd=2,
     main="Figure 10: ROC curve — arrears prediction model",
     print.auc=TRUE)
abline(a=0, b=1, lty=2, col="#888780")

Code

tidy(logit_mod, conf.int=TRUE) |>
  filter(term != "(Intercept)") |>
  mutate(
    term = str_replace_all(term, c(
      "log_amount"="Loan amount (log)",
      "interest_rate"="Interest rate",
      "days_to_first_repayment"="Days to 1st repayment",
      "num_installments"="Number of installments",
      "loan_typeRenewal"="Type: Renewal",
      "loan_typeTop Up"="Type: Top Up",
      "loan_name_shortLPO Financing"="Product: LPO Financing",
      "loan_name_shortBL Quarterly Deferred"="Product: BL Qtrly Deferred"
    )),
    significant = p.value < 0.05
  ) |>
  ggplot(aes(x=estimate, xmin=conf.low, xmax=conf.high,
             y=reorder(term,estimate), colour=significant)) +
  geom_pointrange(linewidth=0.9, size=0.6) +
  geom_vline(xintercept=0, linetype="dashed", colour="#888780") +
  scale_colour_manual(values=c("TRUE"="#E24B4A","FALSE"="#888780"),
                      labels=c("TRUE"="p < 0.05","FALSE"="p ≥ 0.05"), name="") +
  labs(title="Figure 11: Logistic regression coefficient plot",
       subtitle="Red = statistically significant  |  Right of zero = higher arrears risk  |  Bars = 95% CI",
       x="Log-odds coefficient", y=NULL) +
  theme_minimal(base_size=11) + theme(legend.position="bottom")

Code

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report
import warnings; warnings.filterwarnings("ignore")

df_m = df_py[df_py["loan_name_short"] != "BL Deferred Principal"].dropna(
    subset=["days_to_first_repayment"]).copy()
df_m["lt_enc"] = pd.Categorical(df_m["loan_type"]).codes
df_m["ln_enc"] = pd.Categorical(df_m["loan_name_short"]).codes

X = df_m[["log_amount","interest_rate","days_to_first_repayment",
          "num_installments","lt_enc","ln_enc"]]
y = df_m["in_arrears"]

lr = LogisticRegression(max_iter=2000, random_state=42)
lr.fit(X, y)

LogisticRegression(max_iter=2000, random_state=42)

Code

y_prob = lr.predict_proba(X)[:,1]
y_pred = lr.predict(X)

print(f"AUC: {roc_auc_score(y, y_prob):.4f}")

AUC: 0.6712

Code

print(f"\nClassification Report:\n{classification_report(y, y_pred)}")


Classification Report:
              precision    recall  f1-score   support

           0       0.87      1.00      0.93       521
           1       0.00      0.00      0.00        76

    accuracy                           0.87       597
   macro avg       0.44      0.50      0.47       597
weighted avg       0.76      0.87      0.81       597

Code

print(f"\nConfusion Matrix:\n{confusion_matrix(y, y_pred)}")


Confusion Matrix:
[[521   0]
 [ 76   0]]

Code

coefs = pd.DataFrame({"Feature":X.columns,"Coefficient":lr.coef_[0]}).sort_values(
    "Coefficient", ascending=False)
print(f"\nCoefficients:\n{coefs.to_string(index=False)}")


Coefficients:
                Feature  Coefficient
                 ln_enc     0.529550
       num_installments     0.257540
          interest_rate     0.101485
                 lt_enc    -0.005979
days_to_first_repayment    -0.006037
             log_amount    -0.448298

9.4 Business Interpretation of Significant Coefficients

The model achieves AUC = 0.668, using only information available at the point of disbursement. Two predictors are statistically significant at α = 0.05:

Number of Installments (OR = 1.32, p < 0.001): Each additional installment multiplies the odds of arrears by 1.32, holding all other factors constant. For a non-technical manager: “A loan structured over 9 monthly installments is approximately 1.32³ = 2.3× more likely to fall into arrears than an equivalent 6-installment loan. A 12-installment loan is 1.32⁶ = 5.3× riskier. Number of installments is the most reliable arrears signal available at approval time.”

Log Loan Amount (OR = 0.64, p = 0.015): Each unit increase in log loan amount reduces arrears odds by 36%. In plain terms: larger loans are less likely to default. Loans below ₦10M carry a 42.9% arrears rate, while loans above ₦100M carry only 4.9%. For the credit committee: “Smaller facilities carry disproportionate credit risk — likely because smaller borrowers have less financial resilience and receive less intensive underwriting. Size is a meaningful pre-disbursement risk signal.”

Deployment recommendation: Implement two pre-disbursement flags in the loan approval checklist: (1) Number of installments ≥ 9 → mandatory secondary credit committee review; (2) Loan amount below ₦10M → enhanced financial documentation required. Both are based on information already captured in the LMS at the point of application. No new data infrastructure is needed.

10. Integrated Findings

The five analytical techniques form a chain of evidence, each reinforcing and extending the others.

EDA established the data landscape — 629 loans over twelve clean calendar months — and correctly classified 331 missing closure dates as open loans, not data errors. It also profiled Number of Installments for the first time, revealing that while most loans use 6 installments (73.9%), the risk-significant tails at 9 and 12 installments carry arrears rates nearly three times the portfolio average. Visualisation converted these patterns into strategic clarity: BL Deferred Principal loans (0% arrears across 32 loans) and 12-installment facilities (33.3% arrears) represent the two ends of the risk spectrum.

Hypothesis testing added statistical rigour — confirming that tiered pricing across products is real (Kruskal-Wallis p < 0.001), and delivering an equally important null result: loan lifecycle stage does not predict arrears (p = 0.547). This null finding prevents a misguided policy intervention and illustrates that hypothesis testing’s value lies as much in ruling things out as in confirming them. Correlation analysis revealed a new structural relationship: installment count and rate are positively correlated (ρ = +0.372), suggesting the organisation’s product design already partially prices installment risk — but without making this explicit as a disbursement-stage gate. Logistic regression unified all variables into a predictive framework, with Number of Installments (OR = 1.32) and Loan Amount (OR = 0.64) as the only two statistically significant independent predictors. AUC = 0.668 confirms this constitutes a useful, deployable triage tool.

Single integrated recommendation: Introduce a Pre-Disbursement Risk Checklist with two mandatory escalation triggers: (1) any loan application with 9 or more installments → secondary credit committee sign-off required; (2) any loan below ₦10M → enhanced financial statement and collateral documentation required. These rules target both statistically confirmed risk predictors, are operationally actionable without new systems, and are defensible to regulators on the basis of the evidence assembled across all five techniques in this study.

11. Limitations & Further Work

1. BL Deferred Principal excluded from regression. The 0% arrears rate causes perfect separation in the logistic model, requiring exclusion of this product. With a larger historical sample, Firth’s penalised logistic regression would handle this without dropping the product.

2. Class imbalance (88% performing, 12% arrears). The model underperforms at identifying true arrears cases (low sensitivity). With more time, SMOTE oversampling or cost-sensitive learning would improve recall for the arrears class — critical when the cost of a missed default exceeds the cost of a false alarm.

3. Open loans may not yet have revealed their final state. Of the 331 still-open loans (51.5% of the portfolio), some Active loans may eventually fall into arrears after the extraction date. The true 12-month arrears rate could therefore be higher than 12.08% once all loans close. A follow-up extraction in 6 months would provide a more complete picture.

4. No borrower-level covariates. Industry sector, years in operation, prior credit history, and collateral type — standard credit application fields — are absent. Incorporating these from credit files would likely push AUC above 0.80 and enable more precise individual loan scoring.

5. Installment count determinants are unobserved. Understanding why some borrowers receive 9–12 installments (negotiated? product-driven? credit committee discretion?) would clarify whether the arrears association is causal or mediated by borrower credit quality. This warrants qualitative investigation alongside the quantitative evidence.

References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

[Anurika Orabuche]. (2026). Business loan portfolio dataset — May 2025 to April 2026 [Dataset]. Collected from a Nigerian commercial lending institution. Data available on request from the author.

McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.

Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace.

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wilkinson, L. (2005). The grammar of graphics (2nd ed.). Springer.

Appendix: AI Usage Statement

Claude (Anthropic) was used during the preparation of this assignment to assist with structuring the Quarto document template, generating R and Python code scaffolding for the five analytical sections, and advising on the appropriate handling of perfect separation in the logistic regression. All analytical decisions — the choice of a twelve-month study window, the correct interpretation of missing closure dates as open loans, the identification of Number of Installments as the primary new predictor variable, the choice and justification of each hypothesis test, the interpretation of the null result for Loan Type, the interpretation of regression coefficients, the integrated recommendation regarding installment count and loan size thresholds, and all limitations identified — were made independently by the author based on domain knowledge and direct review of all model outputs. The dataset was collected, extracted, and verified by the author. The organisation name has been withheld to protect commercial confidentiality; no other data modifications were made. The author takes full responsibility for all conclusions presented and is prepared to explain and defend every result in the viva voce examination.

Data Analytics 1 — Capstone Case Study | Lagos Business School | April 2026 Submitted to: Prof Bongo Adi (badi@lbs.edu.ng)