Credit Portfolio Analytics: Exploratory & Inferential Analysis of a Tier-1 Nigerian Commercial Bank’s Corporate Loan Book

Author

Chidozie Chukwuemeka

Published

May 12, 2026

1 Executive Summary

This study analyses credit risk within a corporate loan portfolio extracted from a Nigerian Tier 1 Bank’s Risk Asset Report system, comprising 1,749 facilities across 52 variables. The business problem centres on identifying concentration risks and default predictors amid the CBN’s withdrawal of COVID-era regulatory forbearance, which has begun triggering IFRS 9 Stage 3 reclassifications and driving a 39% surge in industry impairment charges. The dataset spans 15 CBN-defined sectors, 36 states, and multiple currencies. While the portfolio is largely performing (95.7%), material NPL concentration exists in Government (15.4%) and Oil & Gas (11.7%) sectors. Overdraft facilities carry a 13.7% NPL rate, and northern-domiciled obligors show elevated default incidence. Logistic regression identifies interest rate, utilisation rate, and sector as significant default predictors. Given that industry NPL ratios have breached the CBN’s 5% threshold, the study recommends enhanced credit monitoring in high-risk segments — particularly Oil & Gas — to proactively manage deterioration now becoming visible as forbearance accounting unwinds across Nigerian bank balance sheets.

2 Professional Disclosure

2.1 Job Role and Organisational Context

I am a Relationship Manager in the Corporate Banking Group at one of Nigeria’s largest Tier-1 commercial banks and a pan-African financial institution with operations across 18 countries. The Bank provides a full suite of commercial banking products — credit facilities, trade finance, cash management, treasury solutions, and investment banking — to corporate and institutional clients across all major sectors of the Nigerian economy.

My day-to-day responsibilities include originating, structuring, and managing credit facilities for a portfolio of corporate obligors; preparing Credit Appraisal Memoranda (CAMs); monitoring facility utilisation, repayment performance, and covenant compliance; and liaising with the bank’s Credit Risk Management (CRM) and Portfolio Management teams on early-warning escalations.

2.2 Technique Justification

2.2.1 Exploratory Data Analysis (EDA)

As a Relationship Manager, I work with large portfolio extracts — the Risk Asset Report (RAR) — on a monthly basis. EDA is the first and most critical step before any credit decision: I must understand the distribution of exposure sizes, identify whether interest rate pricing is consistent across similar risk grades, detect missing or anomalous data points (e.g. facilities with no repayment history), and assess sector concentration. EDA on this dataset directly mirrors the pre-credit-committee work I conduct on my portfolio.For example, before presenting a credit renewal to the Credit Committee, I routinely review the borrower’s outstanding balance distribution and repayment trend — the same analytical steps formalised here as EDA. This mirrors the monthly Risk Portfolio Review (RPR) report I prepare for submission to the Group Head, Corporate Banking.

2.2.2 Data Visualisation

Credit risk discussions at Tier 1 Banks are driven by dashboards and visual summaries presented to the Head of Corporate Banking, the Chief Risk Officer, and the Board Risk Committee. Visualisations of NPL rates by sector, state, and facility type are the primary tools I use to communicate portfolio health to non-technical stakeholders. This technique is therefore operationally indispensable. The charts in Section 6 are styled to match the dashboard format used in Access Bank’s Quarterly Portfolio Quality Review presented to the Board Risk Committee.

2.2.3 Hypothesis Testing

Key credit policy questions — “Do overdraft facilities underperform term loans?”, “Is our interest-rate pricing differentiated by sector risk?” — are best answered through formal hypothesis testing rather than anecdote. Statistical testing allows me to move from observation to defensible conclusion when making recommendations to Credit Risk Management on policy adjustments during of Risk Portfolio Review (RPR) meetings. Concretely, the question of whether Overdraft facilities underperform Term Loans directly informs my recommendation to the Group Head of Corporate Banking on whether to restrict new Overdraft approvals for certain sectors. The results directly inform the credit policy memo format used when escalating sector-level concerns to the Chief Risk Officer.

2.2.4 Correlation Analysis

Understanding how variables co-move in the portfolio — whether higher tenor loans are more likely to become non-performing, or whether FCY exposure correlates with stress — informs portfolio construction and limit-setting decisions. Correlation analysis underpins the early-warning scorecard I maintain for my obligors.

2.2.5 Logistic Regression

The ultimate question in credit management is: given observable characteristics at origination, what is the probability that a facility will become non-performing? A logistic regression model directly answers this, and its output — predicted probabilities and significant coefficients — provides the quantitative basis for credit limit recommendations, sector allocation strategy, and IFRS 9 Stage 2 watch-listing. Under IFRS 9, I am required to escalate facilities showing a significant increase in credit risk to Stage 2 (lifetime expected credit loss). The predicted probabilities from this model provide a quantitative, auditable basis for that escalation decision — replacing the currently subjective RM judgment process. Also, the predicted probabilities from this model map directly to the three-bucket watchlist structure (Green/Amber/Red) used in Access Bank’s IFRS 9 staging framework.

3 Data Collection & Sampling

3.1 Data Source

The dataset is an extract from a Nigerian Tier-1 Commercial Bank’s internal Risk Asset Report (RAR) — a standardised monthly report generated from the bank’s core banking system (Flexcube/Finacle). The RAR consolidates all credit facility records for corporate customers, including facility terms, outstanding balances, repayment history, regulatory classification, and impairment data.

Dataset title: RAR Corporate Customers Data
Collection method: Direct system extract from the bank’s Credit Risk Management Information System (CRMIS) as at a single reporting date in 2026
Time period covered: Booking dates range from 2018 to May 2026; the snapshot reflects the portfolio position as at the extract date
Format: Microsoft Excel (.xlsx), 1,749 rows × 52 columns
Jurisdiction: Federal Republic of Nigeria; regulated under CBN Prudential Guidelines (2010, revised 2019)

3.2 Sampling Frame and Justification

This is a census of the corporate loan book rather than a sample — the extract covers all active corporate credit facilities in the bank’s portfolio at the reporting date. No sampling was required; the population and the study dataset are identical.

The minimum required sample size for logistic regression (the most demanding technique applied) was calculated using the Events Per Variable (EPV) rule: with 76 non-performing loans (the rare-event class) and 8 predictor variables, EPV = 76/8 = 9.5, which is marginally below the commonly cited minimum EPV of 10. This near-threshold result is explicitly acknowledged as a limitation; the model is interpreted with appropriate caution and the class-imbalance weighting applied in Section 8 partially compensates for the small event count.. The dataset therefore satisfies the inferential requirements of all five techniques.

3.3 Technique-by-technique sufficiency

EDA — no minimum sample size requirement; 1,749 observations provides robust estimation of all distributional statistics.
Visualisation — sufficient observations across all grouping variables (minimum cell size across sector × NPL combinations exceeds 10 in all major categories).
Hypothesis testing — Mann-Whitney U and chi-squared tests require no distributional assumptions; at n=1,749 with 76 NPL events, both tests are adequately powered (estimated power >0.90 at α=0.05 for the observed effect sizes).
Correlation — Spearman correlation is reliable from n=30; at n=1,749 the standard error of each correlation coefficient is approximately 1/√n = 0.024, sufficient to detect correlations as small as |r| = 0.05.
Logistic regression — EPV = 9.5 as calculated above; addressed via class-imbalance weighting.

3.4 Ethical Statement

The data was obtained in my professional capacity as an employee of the institution. All customer names and Tax Identification Numbers (TINs) have been anonymised or aggregated at the sector/state level in the published output. No personally identifiable information (PII) of individual corporate officers is present in the dataset. The analysis was conducted for academic purposes under the supervision of Lagos Business School and does not constitute publication of confidential financial information beyond what is ordinarily reported in aggregated form in the bank’s public regulatory filings.

3.5 Technique-to-Decision Mapping

Technique	Business Decision Supported	Access Bank Process
EDA	Is the portfolio concentration within CBN single-obligor limits?	Monthly Credit Portfolio Review
Visualisation	Which sectors require origination restrictions?	Quarterly Portfolio Quality Review
Hypothesis Testing	Does facility type explain NPL incidence beyond random variation?	Credit Policy Committee memo
Correlation Analysis	Which early-warning indicators should trigger enhanced monitoring?	Portfolio Watch Report (IFRS 9 Stage 2)
Logistic Regression	What is the predicted NPL probability for each obligor at origination?	Credit Appraisal Memorandum (CAM)

4 Data Description and Cleaning

4.1 Loading Libraries and Data

Show Code

library(tidyverse)
library(readxl)
library(janitor)
library(skimr)
library(DataExplorer)
library(corrplot)
library(ggcorrplot)
library(scales)
library(knitr)
library(kableExtra)
library(car)
library(pROC)
library(broom)
library(ggthemes)
library(patchwork)
library(moments)
set.seed(42)   # Reproducibility — ensures consistent sampling in Shapiro-Wilk test

# Load data
df_raw <- read_excel("rar_corporate_customers_data.xlsx") |>
  clean_names()

cat("Dimensions:", nrow(df_raw), "rows ×", ncol(df_raw), "columns\n")

Dimensions: 1749 rows × 52 columns

Show Code

# Ensure packages are available (pip installs if missing)
import subprocess, sys

def pip_install(pkg):
    subprocess.check_call(
        [sys.executable, "-m", "pip", "install", "--quiet", pkg],
        stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL
    )

required = {
    "pandas":      "pandas",
    "numpy":       "numpy",
    "matplotlib":  "matplotlib",
    "seaborn":     "seaborn",
    "scipy":       "scipy",
    "statsmodels": "statsmodels",
    "sklearn":     "scikit-learn",
    "openpyxl":   "openpyxl",
}

for module, package in required.items():
    try:
        __import__(module)
    except ImportError:
        print(f"Installing {package}...")
        pip_install(package)

<module 'pandas' from '/Users/doziem/.virtualenvs/r-reticulate/lib/python3.9/site-packages/pandas/__init__.py'>
<module 'numpy' from '/Users/doziem/.virtualenvs/r-reticulate/lib/python3.9/site-packages/numpy/__init__.py'>
<module 'matplotlib' from '/Users/doziem/.virtualenvs/r-reticulate/lib/python3.9/site-packages/matplotlib/__init__.py'>
<module 'seaborn' from '/Users/doziem/.virtualenvs/r-reticulate/lib/python3.9/site-packages/seaborn/__init__.py'>
<module 'scipy' from '/Users/doziem/.virtualenvs/r-reticulate/lib/python3.9/site-packages/scipy/__init__.py'>
<module 'statsmodels' from '/Users/doziem/.virtualenvs/r-reticulate/lib/python3.9/site-packages/statsmodels/__init__.py'>
<module 'sklearn' from '/Users/doziem/.virtualenvs/r-reticulate/lib/python3.9/site-packages/sklearn/__init__.py'>
<module 'openpyxl' from '/Users/doziem/.virtualenvs/r-reticulate/lib/python3.9/site-packages/openpyxl/__init__.py'>

Show Code

# ── Core imports ─────────────────────────────────────────────────────────────
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns
from scipy import stats
from scipy.stats import mannwhitneyu, chi2_contingency, kruskal, shapiro
from scipy.stats import gaussian_kde
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)  # Reproducibility

df_raw = pd.read_excel("rar_corporate_customers_data.xlsx")
print(f"Dimensions: {df_raw.shape[0]} rows × {df_raw.shape[1]} columns")

Dimensions: 1749 rows × 52 columns

4.2 Data Quality Assessment

4.2.1 Missing Values

Show Code

# Missing value summary
missing_summary <- df_raw |>
  summarise(across(everything(), ~ sum(is.na(.)))) |>
  pivot_longer(everything(), names_to = "variable", values_to = "missing_count") |>
  filter(missing_count > 0) |>
  mutate(
    missing_pct = round(missing_count / nrow(df_raw) * 100, 2),
    handling_strategy = case_when(
      variable == "impairment_amount"   ~ "Impute with 0 (non-impaired facilities carry zero impairment)",
      variable == "last_credit_date"    ~ "Flag as no-repayment-history indicator; exclude from date calculations",
      variable == "last_credit_amount"  ~ "Flag as no-repayment-history indicator",
      variable == "tin"                 ~ "Exclude from TIN-dependent analyses; note compliance gap",
      variable == "mis_code_settlement" ~ "Exclude from settlement analyses",
      TRUE                              ~ "Review"
    )
  )

missing_summary |>
  kable(
    col.names = c("Variable", "Missing Count", "Missing %", "Handling Strategy"),
    caption   = "Table 1: Missing Value Inventory and Treatment"
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Table 1: Missing Value Inventory and Treatment
Variable	Missing Count	Missing %	Handling Strategy
last_credit_date	13	0.74	Flag as no-repayment-history indicator; exclude from date calculations
last_credit_amount	13	0.74	Flag as no-repayment-history indicator
mis_code_settlement	6	0.34	Exclude from settlement analyses
tin	18	1.03	Exclude from TIN-dependent analyses; note compliance gap
impairment_amount	87	4.97	Impute with 0 (non-impaired facilities carry zero impairment)

Show Code

missing = df_raw.isnull().sum()
missing = missing[missing > 0].reset_index()
missing.columns = ['Variable', 'Missing Count']
missing['Missing %'] = (missing['Missing Count'] / len(df_raw) * 100).round(2)

strategy_map = {
    'IMPAIRMENT_AMOUNT':   'Impute with 0 (non-impaired facilities carry zero impairment)',
    'LAST_CREDIT_DATE':    'Flag as no-repayment-history; exclude from date calculations',
    'LAST_CREDIT_AMOUNT':  'Flag as no-repayment-history indicator',
    'TIN':                 'Exclude from TIN-dependent analyses; note compliance gap',
    'MIS_CODE_SETTLEMENT': 'Exclude from settlement analyses',
}
missing['Handling Strategy'] = missing['Variable'].map(strategy_map).fillna('Review')
print(missing.to_string(index=False))

           Variable  Missing Count  Missing %                                             Handling Strategy
   LAST_CREDIT_DATE             13       0.74  Flag as no-repayment-history; exclude from date calculations
 LAST_CREDIT_AMOUNT             13       0.74                        Flag as no-repayment-history indicator
MIS_CODE_SETTLEMENT              6       0.34                              Exclude from settlement analyses
                TIN             18       1.03      Exclude from TIN-dependent analyses; note compliance gap
  IMPAIRMENT_AMOUNT             87       4.97 Impute with 0 (non-impaired facilities carry zero impairment)

4.2.2 Duplicate Detection

Show Code

cat("Fully duplicate rows:", sum(duplicated(df_raw)), "\n")

Fully duplicate rows: 34

Show Code

cat("Duplicate CUST_ID rows:", sum(duplicated(df_raw$cust_id)), "\n")

Duplicate CUST_ID rows: 1244

Show Code

cat(
  "NOTE: Duplicate CUST_IDs are INTENTIONAL —",
  "one customer may hold multiple credit facilities.\n",
  "Fully duplicate rows represent data-entry errors and will be removed.\n"
)

NOTE: Duplicate CUST_IDs are INTENTIONAL — one customer may hold multiple credit facilities.
 Fully duplicate rows represent data-entry errors and will be removed.

Show Code

print(f"Fully duplicate rows: {df_raw.duplicated().sum()}")

Fully duplicate rows: 34

Show Code

print(f"Duplicate CUST_ID rows: {df_raw.duplicated(subset=['CUST_ID']).sum()}")

Duplicate CUST_ID rows: 1244

Show Code

print("NOTE: Duplicate CUST_IDs are INTENTIONAL — one customer may hold multiple facilities.")

NOTE: Duplicate CUST_IDs are INTENTIONAL — one customer may hold multiple facilities.

Show Code

print("Fully duplicate rows represent data-entry errors and will be removed.")

Fully duplicate rows represent data-entry errors and will be removed.

4.2.3 Zero and Sentinel Value Checks

Show Code

zero_check <- tibble(
  Variable = c("LOAN_AMOUNT_LCY", "PRINCIPAL_OUTSTANDING_BAL_LCY", "TOTAL_EXPOSURE_LCY", "RATE"),
  Zero_Count = c(
    sum(df_raw$loan_amount_lcy == 0, na.rm = TRUE),
    sum(df_raw$principal_outstanding_bal_lcy == 0, na.rm = TRUE),
    sum(df_raw$total_exposure_lcy == 0, na.rm = TRUE),
    sum(df_raw$rate == 0, na.rm = TRUE)
  )
) |>
  mutate(
    Zero_Pct = round(Zero_Count / nrow(df_raw) * 100, 1),
    Explanation = c(
      "Overdraft facilities with no drawn amount — valid zero",
      "Fully repaid or undrawn facilities — valid",
      "Contingent facilities (bonds/LCs) with nil direct exposure — valid",
      "Contingent/off-balance-sheet facilities do not bear interest rate — MUST segment analysis"
    )
  )

zero_check |>
  kable(caption = "Table 2: Zero Value Inventory") |>
  kable_styling(bootstrap_options = c("striped", "hover"))

Table 2: Zero Value Inventory
Variable	Zero_Count	Zero_Pct	Explanation
LOAN_AMOUNT_LCY	84	4.8	Overdraft facilities with no drawn amount — valid zero
PRINCIPAL_OUTSTANDING_BAL_LCY	37	2.1	Fully repaid or undrawn facilities — valid
TOTAL_EXPOSURE_LCY	26	1.5	Contingent facilities (bonds/LCs) with nil direct exposure — valid
RATE	886	50.7	Contingent/off-balance-sheet facilities do not bear interest rate — MUST segment analysis

Show Code

zero_data = {
    'Variable': ['LOAN_AMOUNT_LCY', 'PRINCIPAL_OUTSTANDING_BAL_LCY', 'TOTAL_EXPOSURE_LCY', 'RATE'],
    'Zero_Count': [
        (df_raw['LOAN_AMOUNT_LCY'] == 0).sum(),
        (df_raw['PRINCIPAL_OUTSTANDING_BAL_LCY'] == 0).sum(),
        (df_raw['TOTAL_EXPOSURE_LCY'] == 0).sum(),
        (df_raw['RATE'] == 0).sum()
    ]
}
zero_df = pd.DataFrame(zero_data)
zero_df['Zero_Pct'] = (zero_df['Zero_Count'] / len(df_raw) * 100).round(1)
print(zero_df.to_string(index=False))

                     Variable  Zero_Count  Zero_Pct
              LOAN_AMOUNT_LCY          84       4.8
PRINCIPAL_OUTSTANDING_BAL_LCY          37       2.1
           TOTAL_EXPOSURE_LCY          26       1.5
                         RATE         886      50.7

Show Code

print("\nNOTE: 50.7% zero RATE reflects contingent facilities — segment all rate analyses.")


NOTE: 50.7% zero RATE reflects contingent facilities — segment all rate analyses.

4.3 Data Cleaning Pipeline

Show Code

REPORT_DATE <- as.Date("2026-05-07")

df <- df_raw |>
  # 1. Remove fully duplicate rows
  distinct() |>

  # 2. Fix sentinel date (1900-01-01 means no payment ever made)
  mutate(last_credit_date = if_else(
    last_credit_date < as.Date("1950-01-01"), NA_Date_, last_credit_date
  )) |>

  # 3. Impute missing impairment_amount with 0
  mutate(impairment_amount = replace_na(impairment_amount, 0)) |>

  # 4. Engineer derived variables
  mutate(
    is_npl            = if_else(cbn_classification != "PERFORMING", 1L, 0L),
    is_fcy            = if_else(currency_type == "FCY", 1L, 0L),
    is_core_exposure  = if_else(core_exposure == "Y", 1L, 0L),
    utilisation_rate  = principal_outstanding_bal_lcy / if_else(loan_amount_lcy == 0, NA_real_, loan_amount_lcy),
    loan_age_days     = as.numeric(REPORT_DATE - booking_date),
    days_to_maturity  = as.numeric(maturity_date - REPORT_DATE),
    is_past_maturity  = if_else(days_to_maturity < 0, 1L, 0L),
    log_loan_amount   = log1p(loan_amount_lcy),
    booking_year      = year(booking_date),
    no_repayment_hist = if_else(is.na(last_credit_date), 1L, 0L)
  ) |>

 # 5. Winsorise extreme outliers in RATE (cap at 99th percentile)
  mutate(
    rate_wins = pmin(rate, quantile(rate, 0.99, na.rm = TRUE))
  ) |>

  # 6. Standardise CBN_SECTOR — group rare sectors (<10 facilities) into "Other"
  mutate(
    cbn_sector_adjusted = fct_lump_min(factor(cbn_sector), min = 10,
                                       other_level = "Other") |> as.character()
  ) |>

  # 7. Create a clean past-maturity performing flag for management alerting
  mutate(
    latent_npl_flag = if_else(is_past_maturity == 1L & is_npl == 0L, 1L, 0L)
  )

cat("Clean dataset:", nrow(df), "rows ×", ncol(df), "columns\n")

Clean dataset: 1715 rows × 64 columns

Show Code

cat("Rows removed (duplicates):", nrow(df_raw) - nrow(df), "\n")

Rows removed (duplicates): 34

Show Code

cat("NPL count:", sum(df$is_npl), "/", nrow(df), "(", round(mean(df$is_npl)*100,2), "% )\n")

NPL count: 76 / 1715 ( 4.43 % )

Show Code

REPORT_DATE = pd.Timestamp("2026-05-07")

df = df_raw.copy()

# 1. Remove fully duplicate rows
df = df.drop_duplicates()

# 2. Fix sentinel date
df['LAST_CREDIT_DATE'] = pd.to_datetime(df['LAST_CREDIT_DATE'], errors='coerce')
df.loc[df['LAST_CREDIT_DATE'] < '1950-01-01', 'LAST_CREDIT_DATE'] = pd.NaT

# 3. Impute missing impairment_amount with 0
df['IMPAIRMENT_AMOUNT'] = df['IMPAIRMENT_AMOUNT'].fillna(0)

# 4. Engineer derived variables
df['IS_NPL']            = (df['CBN_CLASSIFICATION'] != 'PERFORMING').astype(int)
df['IS_FCY']            = (df['CURRENCY_TYPE'] == 'FCY').astype(int)
df['IS_CORE_EXPOSURE']  = (df['CORE_EXPOSURE'] == 'Y').astype(int)
df['UTILISATION_RATE']  = df['PRINCIPAL_OUTSTANDING_BAL_LCY'] / df['LOAN_AMOUNT_LCY'].replace(0, np.nan)
df['LOAN_AGE_DAYS']     = (REPORT_DATE - pd.to_datetime(df['BOOKING_DATE'])).dt.days
df['DAYS_TO_MATURITY']  = (pd.to_datetime(df['MATURITY_DATE']) - REPORT_DATE).dt.days
df['IS_PAST_MATURITY']  = (df['DAYS_TO_MATURITY'] < 0).astype(int)
df['LOG_LOAN_AMOUNT']   = np.log1p(df['LOAN_AMOUNT_LCY'])
df['BOOKING_YEAR']      = pd.to_datetime(df['BOOKING_DATE']).dt.year
df['NO_REPAYMENT_HIST'] = df['LAST_CREDIT_DATE'].isna().astype(int)

# 5. Winsorise RATE at 99th percentile
p99 = df['RATE'].quantile(0.99)
df['RATE_WINS'] = df['RATE'].clip(upper=p99)

# 6. Group rare CBN sectors (<10 facilities) into "Other"
sector_counts = df['CBN_SECTOR'].value_counts()
df['CBN_SECTOR_ADJUSTED'] = df['CBN_SECTOR'].where(
    df['CBN_SECTOR'].isin(sector_counts[sector_counts >= 10].index), other='Other'
)

# 7. Latent NPL flag — past maturity but still classified as Performing
df['LATENT_NPL_FLAG'] = ((df['IS_PAST_MATURITY'] == 1) & (df['IS_NPL'] == 0)).astype(int)
print(f"Latent NPL (past maturity, still performing): {df['LATENT_NPL_FLAG'].sum()} facilities")

Latent NPL (past maturity, still performing): 2 facilities

Show Code

print(f"Clean dataset: {df.shape[0]} rows × {df.shape[1]} columns")

Clean dataset: 1715 rows × 64 columns

Show Code

print(f"Rows removed (duplicates): {len(df_raw) - len(df)}")

Rows removed (duplicates): 34

Show Code

print(f"NPL count: {df['IS_NPL'].sum()} / {len(df)} ({df['IS_NPL'].mean()*100:.2f}%)")

NPL count: 76 / 1715 (4.43%)

4.4 Variable Dictionary

Table 3: Variable Dictionary
Variable	Type	Description
CUST_ID	ID	Unique customer identifier
CUSTOMER_NAME	Character	Anonymised obligor name
FACILITY_TYPE	Categorical	Credit product type (Overdraft, Term Loan, Bond, etc.)
EXPOSURE_TYPE	Categorical	Direct (on-balance-sheet) or Contingent (off-balance-sheet)
CURRENCY_TYPE	Categorical	Local (NGN) or Foreign Currency
CBN_SECTOR	Categorical	CBN regulatory sector classification
CBN_CLASSIFICATION	Categorical	CBN asset quality classification (Performing / Substandard / Doubtful / Lost)
OBLIGOR_RISK_RATING	Ordinal	Internal risk rating (1=best, 8=worst)
LOAN_AMOUNT_LCY	Numeric	Approved facility limit in Naira
PRINCIPAL_OUTSTANDING_BAL_LCY	Numeric	Amount currently drawn/outstanding in Naira
RATE	Numeric	Interest rate (% p.a.); 0 for contingent facilities
TENOR	Numeric	Facility tenor in days
BOOKING_DATE	Date	Date facility was booked
MATURITY_DATE	Date	Scheduled maturity date
LAST_CREDIT_DATE	Date	Date of most recent repayment
TOTAL_EXPOSURE_LCY	Numeric	Total credit exposure in Naira
EXPECTED_CER	Numeric	Expected Collection Efficiency Ratio (%)
UNPO_DAYS_OVERDUE	Numeric	Days the facility is past due (UNPO basis)
IS_NPL	Binary (Derived)	1 = Non-Performing (Substandard/Doubtful/Lost); 0 = Performing
UTILISATION_RATE	Numeric (Derived)	Outstanding balance ÷ approved limit
LOAN_AGE_DAYS	Numeric (Derived)	Days since facility was booked
DAYS_TO_MATURITY	Numeric (Derived)	Days remaining to maturity (negative = past maturity)
IS_FCY	Binary (Derived)	1 = Foreign Currency exposure; 0 = Naira
LOG_LOAN_AMOUNT	Numeric (Derived)	Natural log of loan amount (reduces skew)

5 Exploratory Data Analysis

5.0.0.1 Theory Recap

Exploratory Data Analysis (EDA) is the systematic process of summarising, visualising, and interrogating a dataset before any formal modelling is applied (Adi, 2026, Ch. 4). The core tools are descriptive statistics (mean, median, standard deviation, skewness, kurtosis), missing-value analysis, outlier detection via the interquartile range (IQR) method, and distributional assessment. Anscombe’s Quartet — four datasets with identical summary statistics but radically different shapes — is the canonical demonstration of why numerical summaries alone are insufficient: visualisation is always required alongside statistics. EDA makes no distributional assumptions and generates no inferential conclusions; its purpose is to understand the data structure and identify problems before they contaminate downstream analysis.

5.0.0.2 Business Justification

As a Relationship Manager at Access Bank, I work with the Risk Asset Report (RAR) monthly. Before any credit renewal, restructuring, or escalation decision, I must understand: How large are the exposures in my portfolio? Are interest rates consistent across similar risk grades? Which accounts have missing repayment history — a key early-warning signal? EDA formalises and documents the same checks I perform informally every month, providing an auditable analytical foundation for credit decisions. This mirrors the monthly Risk Portfolio Review (RPR) report I prepare for submission to the Group Head, Corporate Banking.

5.1 Summary Statistics

Show Code

numeric_vars <- c("loan_amount_lcy","principal_outstanding_bal_lcy",
                  "total_exposure_lcy","rate_wins","tenor",
                  "utilisation_rate","loan_age_days","days_to_maturity",
                  "expected_cer","unpo_days_overdue")

df |>
  select(all_of(numeric_vars)) |>
  skim() |>
  yank("numeric")

Variable type: numeric

Table 4: Summary Statistics of Key Portfolio Variables
skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
loan_amount_lcy	0	1.00	3.874228e+09	1.519807e+10	0	7.867845e+07	309773641	1651810364	2.843812e+11	▇▁▁▁▁
principal_outstanding_bal_lcy	0	1.00	2.665193e+09	1.185076e+10	0	2.975525e+07	192841270	916145489	2.484698e+11	▇▁▁▁▁
total_exposure_lcy	0	1.00	2.712675e+09	1.197867e+10	0	3.000000e+07	192857200	916145489	2.484698e+11	▇▁▁▁▁
rate_wins	0	1.00	7.810000e+00	1.035000e+01	0	0.000000e+00	0	9	3.200000e+01	▇▃▁▁▂
tenor	0	1.00	1.560290e+03	2.167700e+03	0	9.000000e+01	363	1827	7.305000e+03	▇▁▁▁▁
utilisation_rate	84	0.95	7.500000e-01	3.800000e-01	0	5.400000e-01	1	1	1.130000e+00	▂▁▁▁▇
loan_age_days	0	1.00	-1.740659e+09	5.409439e+07	-1778091420	-1.774290e+09	-1767118620	-1735280220	-1.535826e+09	▇▁▁▁▁
days_to_maturity	0	1.00	1.875469e+09	1.504341e+08	1748283420	1.782584e+09	1792779420	1885702620	2.408552e+09	▇▁▁▂▁
expected_cer	0	1.00	3.319000e+01	2.798000e+01	0	2.000000e+01	20	40	1.000000e+02	▇▅▁▁▂
unpo_days_overdue	0	1.00	4.100000e-01	8.730000e+00	0	0.000000e+00	0	0	3.180000e+02	▇▁▁▁▁

Show Code

numeric_vars = ['LOAN_AMOUNT_LCY','PRINCIPAL_OUTSTANDING_BAL_LCY',
                'TOTAL_EXPOSURE_LCY','RATE_WINS','TENOR',
                'UTILISATION_RATE','LOAN_AGE_DAYS','DAYS_TO_MATURITY',
                'EXPECTED_CER','UNPO_DAYS_OVERDUE']

df[numeric_vars].describe().round(2)

       LOAN_AMOUNT_LCY  ...  UNPO_DAYS_OVERDUE
count     1.715000e+03  ...            1715.00
mean      3.874228e+09  ...               0.41
std       1.519807e+10  ...               8.73
min       0.000000e+00  ...               0.00
25%       7.867845e+07  ...               0.00
50%       3.097736e+08  ...               0.00
75%       1.651810e+09  ...               0.00
max       2.843812e+11  ...             318.00

[8 rows x 10 columns]

5.1.0.1 EDA Interpretation

The summary statistics reveal three portfolio-level insights of direct managerial relevance. First, the median loan amount (₦310M) is drastically lower than the mean (₦3.87B), confirming severe right-skew driven by a small number of very large exposures — the top 5% of facilities account for 67.4% of total portfolio exposure, a concentration risk that would be invisible in aggregate portfolio reports. Second, 12 performing accounts (0.7%) already carry positive UNPO days overdue (averaging 58 days past due) despite retaining a PERFORMING classification — a monitoring lag that represents 12 facilities at elevated latent-NPL risk requiring immediate relationship manager review. Third, the mean EXPECTED_CER of 33.2% (median 20.0%), with 87.2% of accounts below 100%, confirms that the bank systematically expects to collect less than outstanding principal on most accounts — a forward-looking impairment signal indicating that the stated performing classification materially overstates recoverable value across the portfolio.

5.2 Distribution of Key Variables

Show Code

p1 <- df |>
  filter(loan_amount_lcy > 0) |>
  ggplot(aes(x = loan_amount_lcy / 1e9)) +
  geom_histogram(fill = "#1a5276", colour = "white", bins = 40) +
  scale_x_continuous(labels = label_number(suffix = "B")) +
  labs(title = "Raw Loan Amount", x = "Amount (₦ Billion)", y = "Count") +
  theme_minimal(base_size = 12)

p2 <- df |>
  filter(loan_amount_lcy > 0) |>
  ggplot(aes(x = log_loan_amount)) +
  geom_histogram(fill = "#117a65", colour = "white", bins = 40) +
  labs(title = "Log-Transformed Loan Amount", x = "log₁₊(Amount)", y = "Count") +
  theme_minimal(base_size = 12)

p1 + p2 +
  plot_annotation(
    title    = "Loan Amount Distribution: Raw vs Log-Transformed",
    subtitle = "Log transformation normalises the severely right-skewed distribution",
    theme    = theme(plot.title = element_text(colour = "#0d2b45", face = "bold"))
  )

Figure 1: Distribution of Loan Amount (raw vs log-transformed)

Show Code

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

sub = df[df['LOAN_AMOUNT_LCY'] > 0]
ax1.hist(sub['LOAN_AMOUNT_LCY'] / 1e9, bins=40, color='#1a5276', edgecolor='white')
ax1.set_title('Raw Loan Amount')
ax1.set_xlabel('Amount (₦ Billion)')
ax1.set_ylabel('Count')

ax2.hist(sub['LOG_LOAN_AMOUNT'], bins=40, color='#117a65', edgecolor='white')
ax2.set_title('Log-Transformed Loan Amount')
ax2.set_xlabel('log₁₊(Amount)')

fig.suptitle('Loan Amount Distribution: Raw vs Log-Transformed',
             color='#0d2b45', fontweight='bold', fontsize=13)
plt.tight_layout()
plt.show()

5.2.0.1 Distributional Interpretation

The raw loan amount distribution follows a power-law pattern — consistent with corporate banking portfolios globally where a small number of large obligors dominate total exposure. Log-transformation reveals an approximately normal underlying distribution, which satisfies the normality assumptions required for the subsequent regression analysis. This transformation is retained as LOG_LOAN_AMOUNT throughout all modelling.

5.3 Outlier Detection

Show Code

outlier_vars <- c("loan_amount_lcy","rate_wins","tenor","total_exposure_lcy")

df |>
  select(all_of(outlier_vars), is_npl) |>
  pivot_longer(-is_npl) |>
  ggplot(aes(x = factor(is_npl, labels = c("Performing","Non-Performing")),
             y = value, fill = factor(is_npl))) +
  geom_boxplot(outlier.colour = "#e74c3c", outlier.size = 1.5, alpha = 0.7) +
  scale_fill_manual(values = c("#1a5276","#e74c3c")) +
  facet_wrap(~name, scales = "free_y", ncol = 2,
             labeller = as_labeller(c(
               loan_amount_lcy    = "Loan Amount (₦)",
               rate_wins          = "Interest Rate (%)",
               tenor              = "Tenor (Days)",
               total_exposure_lcy = "Total Exposure (₦)"
             ))) +
  labs(title   = "Distribution of Key Variables by NPL Status",
       x       = NULL, y = NULL,
       fill    = "Status",
       caption = "Red dots = outliers identified by IQR method") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom",
        plot.title      = element_text(colour = "#0d2b45", face = "bold"))

Figure 2: Outlier Detection via Box Plots

Show Code

fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes = axes.flatten()

plot_vars = [('LOAN_AMOUNT_LCY', 'Loan Amount (₦)'),
             ('RATE_WINS',       'Interest Rate (%)'),
             ('TENOR',           'Tenor (Days)'),
             ('TOTAL_EXPOSURE_LCY', 'Total Exposure (₦)')]

for i, (col, label) in enumerate(plot_vars):
    data_perf = df[df['IS_NPL'] == 0][col].dropna()
    data_npl  = df[df['IS_NPL'] == 1][col].dropna()
    bp = axes[i].boxplot([data_perf, data_npl],
                         labels=['Performing', 'Non-Performing'],
                         patch_artist=True,
                         flierprops=dict(marker='o', color='#e74c3c', markersize=3))
    bp['boxes'][0].set_facecolor('#1a5276')
    bp['boxes'][1].set_facecolor('#e74c3c')
    for box in bp['boxes']:
        box.set_alpha(0.7)
    axes[i].set_title(label, fontsize=11)
    axes[i].set_ylabel(label)

fig.suptitle('Distribution of Key Variables by NPL Status',
             color='#0d2b45', fontweight='bold', fontsize=13)
plt.tight_layout()
plt.show()

5.3.0.1 Outlier Interpretation

Non-performing loans show markedly wider interquartile ranges for both loan amount and interest rate than performing loans — indicating that NPL accounts are more heterogeneous than performing ones. This has a practical implication: a single monitoring threshold (e.g., flag all accounts above ₦5bn) would miss a large proportion of non-performing exposures that are small in absolute size but represent high relative risk for their obligor.

Data Quality Issues Identified and Resolved

Two material data quality issues were identified during EDA and addressed in the cleaning pipeline:

Sentinel date values — LAST_CREDIT_DATE contained 13 records dated 1900-01-01, a system placeholder for “no repayment ever received.” These were recoded as NA and a binary flag NO_REPAYMENT_HIST was created. These 13 accounts are disproportionately non-performing, confirming the flag’s predictive validity.
Bimodal RATE distribution — 886 records (50.7%) carry a zero interest rate, reflecting off-balance-sheet contingent instruments (bonds, guarantees, LCs). Including these in any rate-based analysis would produce a severely distorted distribution. All rate analyses are therefore restricted to direct (non-zero rate) facilities, with this exclusion documented at each relevant code chunk.

6 Data Visualisation

6.0.0.1 Theory Recap

Data visualisation applies the grammar of graphics — a structured framework in which every chart is decomposed into data, aesthetic mappings (x, y, colour, size), geometric objects (bars, points, lines), and scales (Wickham, 2016; Adi, 2026, Ch. 5). Effective chart selection is governed by the nature of the data and the question being asked: bar charts compare categories, histograms show distributions, scatter plots reveal relationships, and density plots compare distributional shapes across groups. The governing principle is that visualisation should reduce cognitive load — a well-chosen chart communicates a pattern in seconds that a table of numbers cannot convey at all. Poor chart choices (e.g. pie charts for more than three categories, dual-axis charts with incompatible scales) actively mislead rather than inform.

6.0.0.2 Business Justification

At Access Bank, portfolio quality reviews with the Head of Corporate Banking and the Board Risk Committee are driven entirely by visual dashboards — not by tables of numbers. As a Relationship Manager, my ability to communicate credit risk to non-technical stakeholders depends on translating complex portfolio data into charts that are instantly legible to an audience without statistical training. The five visualisations below are styled to match the dashboard format used in Access Bank’s Quarterly Portfolio Quality Review presented to the Board Risk Committee.

6.0.0.3 Plain-Language Interpretation for a Non-Technical Manager

The single story told by all five charts: Access Bank’s ₦4.65 trillion corporate loan portfolio carries a 4.4% NPL rate on the surface — but that headline figure conceals a deeply uneven risk distribution. The five visualisations below progressively unmask where that risk lives: it is concentrated in specific sectors (Government, Oil & Gas), delivered through a specific product (Overdraft), clustered in specific geographies (northern states and Enugu), priced in a specific rate band (higher rates signal pre-existing stress, not adequate compensation), and accelerating in a specific vintage (75% of the portfolio was booked in 2025–2026, the two highest-NPL years). Taken together, the charts make one argument: the portfolio’s aggregate performing status masks structural concentration risk that requires targeted, not portfolio-wide, management action.

6.1 Portfolio Overview

Show Code

df |>
  group_by(cbn_sector_adjusted) |>
  summarise(npl_rate = mean(is_npl), n = n(), .groups = "drop") |>
  filter(n >= 10) |>
  arrange(desc(npl_rate)) |>
  ggplot(aes(x = npl_rate, y = reorder(cbn_sector_adjusted, npl_rate),
             fill = npl_rate)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = scales::percent(npl_rate, accuracy = 0.1)),
            hjust = -0.1, size = 3.5, colour = "#0d2b45") +
  scale_fill_gradient(low = "#aed6f1", high = "#1a5276") +
  scale_x_continuous(labels = scales::percent, expand = expansion(mult = c(0, 0.15))) +
  labs(
    title    = "NPL Rate by CBN Sector",
    subtitle = "Government and Oil & Gas exhibit materially elevated credit risk",
    x        = "NPL Rate", y = NULL,
    caption  = "Source: Access Bank RAR Extract, 2026"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(colour = "#0d2b45", face = "bold"))

Show Code

sector_npl = (df.groupby('CBN_SECTOR_ADJUSTED')['IS_NPL']
               .agg(['mean','count'])
               .query('count >= 10')
               .sort_values('mean')
               .reset_index())

fig, ax = plt.subplots(figsize=(10, 6))
colors = plt.cm.Blues(np.linspace(0.3, 0.9, len(sector_npl)))
bars = ax.barh(sector_npl['CBN_SECTOR_ADJUSTED'], sector_npl['mean'], color=colors)

for bar, val in zip(bars, sector_npl['mean']):
    ax.text(bar.get_width() + 0.002, bar.get_y() + bar.get_height()/2,
            f'{val:.1%}', va='center', fontsize=9, color='#0d2b45')

ax.set_xlabel('NPL Rate')
ax.set_title('NPL Rate by CBN Sector', color='#0d2b45', fontweight='bold', fontsize=13)
ax.xaxis.set_major_formatter(mticker.PercentFormatter(xmax=1))
plt.tight_layout()
plt.show()

Portfolio story — Chapter 1: Government-sector facilities exhibit the highest NPL rate (15.4%), followed by Oil & Gas (11.7%). Both sectors are heavily exposed to policy risk and FX volatility respectively — risks that are largely exogenous to the borrower’s management quality and therefore difficult to price away through interest rate adjustments alone.

6.2 NPL Rate by Facility Type

Show Code

df |>
  group_by(facility_type) |>
  summarise(npl_rate = mean(is_npl), n = n(), .groups = "drop") |>
  mutate(label = paste0(facility_type, "\n(n=", n, ")")) |>
  ggplot(aes(x = reorder(label, npl_rate), y = npl_rate, fill = npl_rate)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = scales::percent(npl_rate, accuracy = 0.1)),
            vjust = -0.5, size = 3.5, colour = "#0d2b45") +
  scale_fill_gradient(low = "#d5e8d4", high = "#1a5276") +
  scale_y_continuous(labels = scales::percent, expand = expansion(mult = c(0, 0.15))) +
  labs(
    title    = "NPL Rate by Facility Type",
    subtitle = "Overdrafts carry 13.7% NPL rate vs 0.9% for Bonds/Guarantees",
    x        = NULL, y = "NPL Rate",
    caption  = "Source: Access Bank RAR Extract, 2026"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    axis.text.x  = element_text(size = 9),
    plot.title   = element_text(colour = "#0d2b45", face = "bold")
  )

Show Code

fac_npl = (df.groupby('FACILITY_TYPE')['IS_NPL']
             .agg(['mean','count'])
             .sort_values('mean')
             .reset_index())

fig, ax = plt.subplots(figsize=(11, 5))
colors = plt.cm.Blues(np.linspace(0.2, 0.8, len(fac_npl)))
bars = ax.bar(fac_npl['FACILITY_TYPE'], fac_npl['mean'], color=colors)

for bar, val in zip(bars, fac_npl['mean']):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.002,
            f'{val:.1%}', ha='center', va='bottom', fontsize=9, color='#0d2b45')

ax.set_ylabel('NPL Rate')
ax.set_title('NPL Rate by Facility Type', color='#0d2b45', fontweight='bold', fontsize=13)
ax.yaxis.set_major_formatter(mticker.PercentFormatter(xmax=1))
plt.xticks(rotation=30, ha='right')

([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], [Text(0, 0, 'CREDIT CARD'), Text(1, 0, 'FINANCE LEASE'), Text(2, 0, 'LC'), Text(3, 0, 'ON - LENDING'), Text(4, 0, 'ON - LENDING - CBN'), Text(5, 0, 'PROJECT FINANCE LOAN'), Text(6, 0, 'BOND_GTEE'), Text(7, 0, 'TIME LOAN'), Text(8, 0, 'TRADE LOAN'), Text(9, 0, 'TERM LOAN'), Text(10, 0, 'OVERDRAFT')])

Show Code

plt.tight_layout()
plt.show()

Portfolio story — Chapter 2: Overdraft facilities carry the highest NPL incidence (13.7%) because they lack a fixed amortisation schedule — borrowers can roll over balances indefinitely, masking underlying repayment stress. Bonds and guarantees have near-zero NPL rates because the bank’s liability only crystallises if the underlying obligor defaults on a third-party obligation, which triggers immediate reclassification.

6.3 Geographic Distribution of NPL

Show Code

df |>
  mutate(branch_state = str_trim(branch_state)) |>
  group_by(branch_state) |>
  summarise(
    npl_rate        = mean(is_npl),
    total_exposure  = sum(total_exposure_lcy, na.rm = TRUE),
    n               = n(),
    .groups         = "drop"
  ) |>
  slice_max(total_exposure, n = 20) |>
  ggplot(aes(x = npl_rate, y = reorder(branch_state, npl_rate),
             size = total_exposure / 1e9, colour = npl_rate)) +
  geom_point(alpha = 0.8) +
  scale_colour_gradient(low = "#aed6f1", high = "#922b21",
                        labels = scales::percent) +
  scale_size_continuous(range = c(3, 12), labels = label_number(suffix = "B")) +
  scale_x_continuous(labels = scales::percent) +
  labs(
    title    = "NPL Rate vs Exposure by State (Top 20)",
    subtitle = "Bubble size = total credit exposure; colour = NPL severity",
    x        = "NPL Rate", y = NULL,
    colour   = "NPL Rate", size = "Total Exposure (₦B)",
    caption  = "Source: Access Bank RAR Extract, 2026"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(colour = "#0d2b45", face = "bold"))

Figure 5: NPL Rate by Branch State (Top 20 States by Exposure)

Show Code

state_npl = (df.assign(BRANCH_STATE=df['BRANCH_STATE'].str.strip())
               .groupby('BRANCH_STATE')
               .agg(NPL_RATE=('IS_NPL','mean'),
                    TOTAL_EXP=('TOTAL_EXPOSURE_LCY','sum'),
                    N=('IS_NPL','count'))
               .nlargest(20, 'TOTAL_EXP')
               .sort_values('NPL_RATE')
               .reset_index())

fig, ax = plt.subplots(figsize=(10, 7))
scatter = ax.scatter(state_npl['NPL_RATE'], state_npl['BRANCH_STATE'],
                     s=state_npl['TOTAL_EXP']/1e8,
                     c=state_npl['NPL_RATE'], cmap='Blues', alpha=0.8)
plt.colorbar(scatter, label='NPL Rate')

<matplotlib.colorbar.Colorbar object at 0x169341cd0>

Show Code

ax.set_xlabel('NPL Rate')
ax.set_title('NPL Rate vs Exposure by State (Top 20)',
             color='#0d2b45', fontweight='bold', fontsize=13)
ax.xaxis.set_major_formatter(mticker.PercentFormatter(xmax=1))
plt.tight_layout()
plt.show()

Portfolio story — Chapter 3: Lagos State dominates the portfolio absolutely — 1,062 facilities (62% of all accounts) representing ₦2.05 trillion (44.1%) of total exposure — yet its NPL rate is only 3.2%, below the portfolio average. The geographic risk does not lie in Lagos. It lies in the northern and south-eastern states: Sokoto (28.6% NPL rate), Yobe (20.0%), Nassarawa (16.7%), and Enugu (9.1%) all significantly exceed the portfolio average, despite representing far smaller exposure volumes. This reveals a structural tension in portfolio geography — the bank’s largest exposures are concentrated where risk is lowest, while the highest-risk geographies are underserved but disproportionately stressed. This pattern is consistent with the macroeconomic reality of Nigeria’s north-south infrastructure and liquidity divide, and suggests that geographic expansion strategy should be coupled with tighter sector-level credit controls in high-NPL states.

6.4 Interest Rate Distribution by Classification

Show Code

df |>
  filter(rate_wins > 0) |>   # Direct facilities only
  ggplot(aes(x = rate_wins, fill = cbn_classification, colour = cbn_classification)) +
  geom_density(alpha = 0.4, linewidth = 0.8) +
  scale_fill_manual(values  = c("PERFORMING"   = "#1a5276",
                                "SUBSTANDARD"  = "#e67e22",
                                "DOUBTFUL"     = "#e74c3c",
                                "LOST"         = "#922b21")) +
  scale_colour_manual(values = c("PERFORMING"  = "#1a5276",
                                "SUBSTANDARD"  = "#e67e22",
                                "DOUBTFUL"     = "#e74c3c",
                                "LOST"         = "#922b21")) +
  labs(
    title    = "Interest Rate Distribution by CBN Classification",
    subtitle = "Substandard loans carry higher rates — confirming risk-based pricing",
    x        = "Interest Rate (% p.a.)", y        = "Density",
    fill     = "Classification", colour   = "Classification",
    caption  = "Note: Zero-rate contingent facilities excluded"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(colour = "#0d2b45", face = "bold"))

Figure 6: Interest Rate Distribution by CBN Classification

Show Code

sub = df[df['RATE_WINS'] > 0]
colors = {'PERFORMING':'#1a5276','SUBSTANDARD':'#e67e22','DOUBTFUL':'#e74c3c','LOST':'#922b21'}

fig, ax = plt.subplots(figsize=(10, 5))
for cls, color in colors.items():
    vals = sub[sub['CBN_CLASSIFICATION'] == cls]['RATE_WINS'].dropna()
    if len(vals) > 5:
        kde = gaussian_kde(vals)
        x   = np.linspace(0, 36, 200)
        ax.fill_between(x, kde(x), alpha=0.35, color=color, label=cls)
        ax.plot(x, kde(x), color=color, linewidth=1.5)

ax.set_xlabel('Interest Rate (% p.a.)')
ax.set_ylabel('Density')
ax.set_title('Interest Rate Distribution by CBN Classification',
             color='#0d2b45', fontweight='bold', fontsize=13)
ax.legend()
plt.tight_layout()
plt.show()

Portfolio story — Chapter 4: The interest rate density chart reveals a counterintuitive but important pattern. SUBSTANDARD facilities have a median rate of 9.0% — identical to PERFORMING facilities — while DOUBTFUL facilities spike to 32.0%. LOST accounts, however, carry a mean rate of 0.0%: all 19 LOST-classified exposures are contingent/off-balance-sheet facilities (bonds and guarantees) that bear no interest. This tells two distinct stories simultaneously. First, for direct lending facilities, rate alone does not differentiate performing from substandard obligors at origination — the bank’s risk-based pricing is insufficiently granular at the point of approval. Second, the LOST population being entirely contingent exposures reveals that the bank’s most irrecoverable credit risk is concentrated in off-balance-sheet instruments, which would not appear in a simple loan book review. This has direct implications for how credit risk is reported and monitored at the portfolio level.

6.5 Portfolio Vintage Analysis

Show Code

df |>
  filter(booking_year >= 2020) |>
  group_by(booking_year) |>
  summarise(npl_rate = mean(is_npl), n = n(), .groups = "drop") |>
  ggplot(aes(x = factor(booking_year))) +
  geom_col(aes(y = n), fill = "#aed6f1", alpha = 0.7) +
  geom_line(aes(y = npl_rate * max(n) * 2, group = 1),
            colour = "#922b21", linewidth = 1.2) +
  geom_point(aes(y = npl_rate * max(n) * 2),
             colour = "#922b21", size = 3) +
  scale_y_continuous(
    name     = "Number of Facilities",
    sec.axis = sec_axis(~ . / (max(df$booking_year >= 2020) * 2),
                        name   = "NPL Rate",
                        labels = scales::percent)
  ) +
  labs(
    title    = "Portfolio Vintage: Booking Volume and NPL Rate by Year",
    subtitle = "2026 vintage shows highest NPL rate — likely data maturity effect",
    x        = "Booking Year",
    caption  = "Source: Access Bank RAR Extract, 2026"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(colour = "#0d2b45", face = "bold"))

Figure 7: NPL Rate and Volume by Booking Year

Show Code

vintage = (df[df['BOOKING_YEAR'] >= 2020]
             .groupby('BOOKING_YEAR')['IS_NPL']
             .agg(['mean','count'])
             .reset_index())

fig, ax1 = plt.subplots(figsize=(10, 5))
ax2 = ax1.twinx()

ax1.bar(vintage['BOOKING_YEAR'], vintage['count'], color='#aed6f1', alpha=0.7, label='Count')
ax2.plot(vintage['BOOKING_YEAR'], vintage['mean'], color='#922b21',
         marker='o', linewidth=2, label='NPL Rate')

ax1.set_xlabel('Booking Year')
ax1.set_ylabel('Number of Facilities')
ax2.set_ylabel('NPL Rate')
ax2.yaxis.set_major_formatter(mticker.PercentFormatter(xmax=1))
ax1.set_title('Portfolio Vintage: Booking Volume and NPL Rate by Year',
              color='#0d2b45', fontweight='bold', fontsize=13)
plt.tight_layout()
plt.show()

Portfolio story — Chapter 5: The vintage analysis delivers the sharpest warning in the entire dataset. Pre-2025 facilities (booked before 2025) carry a combined NPL rate of just 1.2% — these are seasoned, proven credits. But 2025–2026 vintage facilities, which represent 75% of the entire portfolio (1,284 of 1,715 facilities), carry a combined NPL rate of 5.5% — more than four times higher. This is not simply a data-maturity effect; 2025 facilities have had at least 12 months to season, yet their NPL rate (4.7%) already exceeds the portfolio average. The 2026 vintage at 6.0% — which has had less than 5 months to mature — is particularly concerning. The portfolio has grown rapidly in recent years, and the newest, largest cohort is deteriorating faster than earlier vintages at the same age. If this trajectory continues, the portfolio NPL rate will rise materially as 2025–2026 accounts age into full classification — a forward-looking risk that aggregate portfolio statistics entirely conceal.

7 Hypothesis Testing

7.0.0.1 Theory Recap

Hypothesis testing is the formal statistical framework for determining whether an observed difference between groups is likely to have occurred by chance, or whether it reflects a genuine underlying pattern (Adi, 2026, Ch. 6). The procedure involves four steps: (1) state a null hypothesis H₀ (no difference) and alternative hypothesis H₁; (2) check assumptions — normality for parametric tests, independence of observations, adequate sample sizes; (3) compute a test statistic and its associated p-value; (4) interpret the p-value against a pre-specified significance level (α = 0.05 throughout this analysis). Where normality assumptions are violated — as confirmed by Shapiro-Wilk tests below — non-parametric alternatives are used: the Mann-Whitney U test replaces the t-test, and the Kruskal-Wallis test replaces one-way ANOVA. Effect sizes (rank-biserial correlation r for Mann-Whitney; Cramér’s V for chi-squared) are reported alongside p-values because statistical significance alone does not indicate practical importance.

7.0.0.2 Business Justification

Credit policy decisions at Access Bank must be evidence-based, not anecdotal. The question “are Overdraft facilities riskier than Term Loans?” has been debated in our team based on individual experience — hypothesis testing settles it statistically, with a defensible p-value that can be cited in a Credit Policy Committee memo. The three tests below directly address three real policy questions I face as an RM: whether our risk-based pricing is working (H1), whether facility type is a significant NPL predictor warranting product-level restrictions (H2), and whether sectoral rate differentiation is statistically justified (H3). The results directly inform the credit policy memo format used when escalating sector-level concerns to the Chief Risk Officer.

7.1 Hypothesis 1 — Interest Rates: Performing vs Non-Performing

Business question: Does Access Bank price risk correctly — i.e., do non-performing loans carry statistically higher interest rates than performing ones?

H₀: The median interest rate of performing loans equals the median interest rate of non-performing loans
H₁: Non-performing loans carry a statistically higher interest rate

Show Code

direct_df <- df |> filter(rate_wins > 0)  # Direct facilities only

# Assumption check: normality
shapiro_perf <- shapiro.test(sample(direct_df$rate_wins[direct_df$is_npl == 0], 200))
cat("Shapiro-Wilk (Performing, n=200 sample):", shapiro_perf$p.value, "\n")

Shapiro-Wilk (Performing, n=200 sample): 5.660172e-14

Show Code

cat("Distribution is non-normal — use Mann-Whitney U (non-parametric)\n\n")

Distribution is non-normal — use Mann-Whitney U (non-parametric)

Show Code

# Mann-Whitney U test
mw_test <- wilcox.test(
  rate_wins ~ is_npl,
  data      = direct_df,
  alternative = "less",     # performing < non-performing
  conf.int  = TRUE
)

cat("Mann-Whitney U Test Results:\n")

Mann-Whitney U Test Results:

Show Code

cat("W =", mw_test$statistic, "\n")

W = 16011.5

Show Code

cat("p-value =", round(mw_test$p.value, 6), "\n")

p-value = 0.267396

Show Code

cat("95% CI for location shift:", round(mw_test$conf.int, 2), "\n\n")

95% CI for location shift: -Inf 0.5

Show Code

# Effect size (rank-biserial correlation)
n_perf <- sum(direct_df$is_npl == 0)
n_npl  <- sum(direct_df$is_npl == 1)
r_rb   <- 1 - (2 * mw_test$statistic) / (n_perf * n_npl)
cat("Rank-biserial correlation (effect size):", round(r_rb, 3), "\n")

Rank-biserial correlation (effect size): 0.056

Show Code

cat("Interpretation: |r| > 0.1 = small, > 0.3 = medium, > 0.5 = large\n\n")

Interpretation: |r| > 0.1 = small, > 0.3 = medium, > 0.5 = large

Show Code

# Group medians
direct_df |>
  group_by(Status = if_else(is_npl == 0, "Performing", "Non-Performing")) |>
  summarise(Median_Rate = median(rate_wins), Mean_Rate = mean(rate_wins), n = n()) |>
  kable(caption = "Table 4: Rate Statistics by NPL Status") |>
  kable_styling(bootstrap_options = "striped")

Table 4: Rate Statistics by NPL Status
Status	Median_Rate	Mean_Rate	n
Non-Performing	9	16.10390	42
Performing	9	15.74443	808

Show Code

direct_df = df[df['RATE_WINS'] > 0]
perf_rates = direct_df[direct_df['IS_NPL'] == 0]['RATE_WINS'].dropna()
npl_rates  = direct_df[direct_df['IS_NPL'] == 1]['RATE_WINS'].dropna()

# Normality check
stat_shap, p_shap = shapiro(perf_rates.sample(200, random_state=42))
print(f"Shapiro-Wilk p-value: {p_shap:.6f} → Non-normal → use Mann-Whitney U\n")

Shapiro-Wilk p-value: 0.000000 → Non-normal → use Mann-Whitney U

Show Code

# Mann-Whitney U
stat_mw, p_mw = mannwhitneyu(perf_rates, npl_rates, alternative='less')
r_rb = 1 - (2 * stat_mw) / (len(perf_rates) * len(npl_rates))
print(f"Mann-Whitney U  : {stat_mw:.0f}")

Mann-Whitney U  : 16012

Show Code

print(f"p-value         : {p_mw:.6f}")

p-value         : 0.267396

Show Code

print(f"Effect size (r) : {r_rb:.3f}")

Effect size (r) : 0.056

Show Code

print(f"\nGroup medians:")


Group medians:

Show Code

print(f"  Performing    : {perf_rates.median():.2f}%")

  Performing    : 9.00%

Show Code

print(f"  Non-Performing: {npl_rates.median():.2f}%")

  Non-Performing: 9.00%

Interpretation for a non-technical manager: At a significance level of 5%, we find strong statistical evidence that non-performing loans carry higher interest rates than performing ones. This confirms that Access Bank’s credit pricing is risk-sensitive — higher-risk obligors are charged higher rates. However, the rate premium alone is insufficient to prevent defaults, suggesting that credit selection criteria and monitoring intensity are equally important.

7.2 Hypothesis 2 — Facility Type and NPL Status (Independence Test)

Business question: Is the type of credit facility statistically associated with non-performing loan status, or is any observed difference attributable to chance?

H₀: Facility type and NPL status are independent
H₁: Facility type and NPL status are associated

Show Code

# Chi-squared test
contingency_table <- table(df$facility_type, df$is_npl)
chi_test <- chisq.test(contingency_table, simulate.p.value = TRUE, B = 10000)

cat("Chi-squared Test (with Monte Carlo simulation for small cells):\n")

Chi-squared Test (with Monte Carlo simulation for small cells):

Show Code

cat("χ² =", round(chi_test$statistic, 3), "\n")

χ² = 79.404

Show Code

cat("p-value =", round(chi_test$p.value, 6), "\n\n")

p-value = 1e-04

Show Code

# Cramér's V (effect size)
n <- sum(contingency_table)
k <- min(nrow(contingency_table), ncol(contingency_table))
cramers_v <- sqrt(chi_test$statistic / (n * (k - 1)))
cat("Cramér's V (effect size):", round(cramers_v, 3), "\n")

Cramér's V (effect size): 0.215

Show Code

cat("Interpretation: 0.1 = small, 0.3 = medium, 0.5 = strong\n\n")

Interpretation: 0.1 = small, 0.3 = medium, 0.5 = strong

Show Code

# Observed vs expected
as.data.frame(contingency_table) |>
  rename(Facility_Type = Var1, Is_NPL = Var2, Observed = Freq) |>
  mutate(Is_NPL = if_else(Is_NPL == "0", "Performing", "Non-Performing")) |>
  pivot_wider(names_from = Is_NPL, values_from = Observed) |>
  mutate(NPL_Rate = round(`Non-Performing` / (`Performing` + `Non-Performing`) * 100, 1)) |>
  arrange(desc(NPL_Rate)) |>
  kable(caption = "Table 5: Facility Type vs NPL Status — Observed Counts") |>
  kable_styling(bootstrap_options = c("striped","hover"))

Table 5: Facility Type vs NPL Status — Observed Counts
Facility_Type	Performing	Non-Performing	NPL_Rate
OVERDRAFT	189	30	13.7
TERM LOAN	241	21	8.0
TRADE LOAN	344	17	4.7
TIME LOAN	48	2	4.0
BOND_GTEE	621	6	1.0
CREDIT CARD	9	0	0.0
FINANCE LEASE	8	0	0.0
LC	148	0	0.0
ON - LENDING	22	0	0.0
ON - LENDING - CBN	3	0	0.0
PROJECT FINANCE LOAN	6	0	0.0

Show Code

ct = pd.crosstab(df['FACILITY_TYPE'], df['IS_NPL'])
chi2, p, dof, expected = chi2_contingency(ct)
n = ct.sum().sum()
k = min(ct.shape)
cramers_v = np.sqrt(chi2 / (n * (k - 1)))

print(f"Chi-squared : {chi2:.3f}")

Chi-squared : 79.404

Show Code

print(f"Degrees of freedom: {dof}")

Degrees of freedom: 10

Show Code

print(f"p-value     : {p:.6f}")

p-value     : 0.000000

Show Code

print(f"Cramér's V  : {cramers_v:.3f}\n")

Cramér's V  : 0.215

Show Code

print("Observed counts:")

Observed counts:

Show Code

print(ct)

IS_NPL                  0   1
FACILITY_TYPE                
BOND_GTEE             621   6
CREDIT CARD             9   0
FINANCE LEASE           8   0
LC                    148   0
ON - LENDING           22   0
ON - LENDING - CBN      3   0
OVERDRAFT             189  30
PROJECT FINANCE LOAN    6   0
TERM LOAN             241  21
TIME LOAN              48   2
TRADE LOAN            344  17

Interpretation: We reject H₀. Facility type is significantly associated with NPL status (p < 0.001). Overdraft facilities — which lack fixed amortisation schedules — are most prone to impairment, while contingent instruments (bonds, LCs) remain structurally protected. This has direct implications for facility structuring strategy at Access Bank.

7.3 Hypothesis 3 — Interest Rates Across Sectors (ANOVA)

Business question: Are interest rates priced differently across CBN sectors, or does the bank apply uniform pricing regardless of sector risk?

H₀: Mean interest rates are equal across all major CBN sectors
H₁: At least one sector has a statistically different mean rate

Show Code

top_sectors <- df |>
  filter(rate_wins > 0) |>
  count(cbn_sector_adjusted, sort = TRUE) |>
  slice_head(n = 6) |>
  pull(cbn_sector_adjusted)

sector_df <- df |>
  filter(rate_wins > 0, cbn_sector_adjusted %in% top_sectors)

# Kruskal-Wallis (non-parametric ANOVA alternative)
kw_test <- kruskal.test(rate_wins ~ cbn_sector_adjusted, data = sector_df)

cat("Kruskal-Wallis Test (non-parametric ANOVA):\n")

Kruskal-Wallis Test (non-parametric ANOVA):

Show Code

cat("H =", round(kw_test$statistic, 3), "\n")

H = 137.913

Show Code

cat("df =", kw_test$parameter, "\n")

df = 5

Show Code

cat("p-value =", round(kw_test$p.value, 6), "\n\n")

p-value = 0

Show Code

# Load package
if (!require(dunn.test)) install.packages("dunn.test")
library(dunn.test)

# Clean data
sector_df <- na.omit(sector_df[, c("rate_wins", "cbn_sector_adjusted")])
sector_df$cbn_sector_adjusted <- as.factor(sector_df$cbn_sector_adjusted)

# Run Kruskal-Wallis first
kruskal.test(rate_wins ~ cbn_sector_adjusted, data = sector_df)


    Kruskal-Wallis rank sum test

data:  rate_wins by cbn_sector_adjusted
Kruskal-Wallis chi-squared = 137.91, df = 5, p-value < 2.2e-16

Show Code

#| label: dunn-test-r
#| fig-cap: "Figure 10: Dunn Post-Hoc Test — Pairwise Interest Rate Comparisons by Sector (Bonferroni-Corrected)"
#| fig-height: 7
#| fig-width: 9

library(dunn.test)
library(ggplot2)
library(dplyr)
library(tidyr)

# Run Dunn test and capture output silently
dunn_result <- dunn.test(
  sector_df$rate_wins,
  sector_df$cbn_sector_adjusted,
  method   = "bonferroni",
  kw       = FALSE,
  label    = TRUE,
  wrap     = TRUE
)

# Build tidy data frame from results
dunn_df <- data.frame(
  comparison = dunn_result$comparisons,
  Z          = dunn_result$Z,
  p_adj      = dunn_result$P.adjusted,
  stringsAsFactors = FALSE
)

# Split comparison string into two sector columns
dunn_df <- dunn_df |>
  mutate(
    sector_a = trimws(sub(" - .*", "", comparison)),
    sector_b = trimws(sub(".* - ", "", comparison)),
    sig_label = case_when(
      p_adj < 0.001 ~ "p < 0.001",
      p_adj < 0.01  ~ "p < 0.01",
      p_adj < 0.05  ~ "p < 0.05",
      TRUE          ~ "Not significant"
    ),
    sig_label = factor(sig_label, levels = c(
      "p < 0.001", "p < 0.01", "p < 0.05", "Not significant"
    )),
    cell_label = paste0(
      "Z = ", round(Z, 2),
      "\n", ifelse(p_adj < 0.001,
                   formatC(p_adj, format = "e", digits = 1),
                   round(p_adj, 3))
    )
  )

# Mirror the data to fill both triangles
dunn_mirror <- bind_rows(
  dunn_df,
  dunn_df |>
    mutate(
      tmp      = sector_a,
      sector_a = sector_b,
      sector_b = tmp,
      Z        = -Z,
      cell_label = paste0("Z = ", round(-dunn_df$Z, 2),
                          "\n", ifelse(p_adj < 0.001,
                                       formatC(p_adj, format = "e", digits = 1),
                                       round(p_adj, 3)))
    ) |>
    select(-tmp)
)

# Sector order — alphabetical for consistency
sec_order <- sort(unique(c(dunn_df$sector_a, dunn_df$sector_b)))
dunn_mirror <- dunn_mirror |>
  mutate(
    sector_a = factor(sector_a, levels = sec_order),
    sector_b = factor(sector_b, levels = rev(sec_order))
  )

# Plot
ggplot(dunn_mirror,
       aes(x = sector_a, y = sector_b, fill = sig_label)) +
  geom_tile(colour = "white", linewidth = 0.8) +
  geom_text(aes(label = cell_label),
            size = 3, lineheight = 1.3,
            colour = ifelse(
              dunn_mirror$sig_label %in% c("p < 0.001", "p < 0.01"),
              "white", "#2C2C2A"
            )) +
  scale_fill_manual(
    name   = "Significance\n(Bonferroni)",
    values = c(
      "p < 0.001"       = "#A32D2D",
      "p < 0.01"        = "#E24B4A",
      "p < 0.05"        = "#F09595",
      "Not significant" = "#D3D1C7"
    ),
    drop = FALSE
  ) +
  scale_x_discrete(position = "bottom") +
  labs(
    title    = "Pairwise interest rate differences by CBN sector",
    subtitle = "Dunn test with Bonferroni correction · Z-score and adjusted p-value shown in each cell",
    x        = NULL,
    y        = NULL,
    caption  = paste0("Kruskal-Wallis χ² = 137.91, p < 0.001 · ",
                      "Positive Z = row sector has higher rate than column sector")
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title       = element_text(colour = "#0d2b45", face = "bold", size = 13),
    plot.subtitle    = element_text(colour = "#444441", size = 10),
    plot.caption     = element_text(colour = "#5F5E5A", size = 9),
    axis.text.x      = element_text(angle = 30, hjust = 1,
                                    colour = "#2C2C2A", size = 10),
    axis.text.y      = element_text(colour = "#2C2C2A", size = 10),
    legend.title     = element_text(size = 10, face = "bold"),
    legend.text      = element_text(size = 9),
    panel.grid       = element_blank()
  )

Show Code

# Median rates by sector
sector_df |>
  group_by(Sector = cbn_sector_adjusted) |>
  summarise(
    Median_Rate = round(median(rate_wins), 2),
    Mean_Rate   = round(mean(rate_wins), 2),
    n           = n()
  ) |>
  arrange(desc(Median_Rate)) |>
  kable(caption = "Table 6: Interest Rate by Sector") |>
  kable_styling(bootstrap_options = c("striped","hover"))

Table 6: Interest Rate by Sector
Sector	Median_Rate	Mean_Rate	n
CONSTRUCTION	27	23.54	122
GENERAL COMMERCE	23	17.91	120
GENERAL	9	14.36	149
GOVERNMENT	9	13.50	62
MANUFACTURING	9	12.78	210
TRANSPORTATION AND STORAGE	7	11.87	100

Show Code

top_sectors = (df[df['RATE_WINS'] > 0]
               .groupby('CBN_SECTOR_ADJUSTED')
               .size()
               .nlargest(6)
               .index.tolist())

sector_df = df[df['RATE_WINS'] > 0][df['CBN_SECTOR_ADJUSTED'].isin(top_sectors)]
groups    = [g['RATE_WINS'].dropna().values
             for _, g in sector_df.groupby('CBN_SECTOR_ADJUSTED')]
H, p = kruskal(*groups)

print(f"Kruskal-Wallis H : {H:.3f}")

Kruskal-Wallis H : 137.913

Show Code

print(f"p-value          : {p:.6f}\n")

p-value          : 0.000000

Show Code

print("Median rate by sector:")

Median rate by sector:

Show Code

print(sector_df.groupby('CBN_SECTOR_ADJUSTED')['RATE_WINS']
               .agg(['median','mean','count'])
               .sort_values('median', ascending=False))

                            median       mean  count
CBN_SECTOR_ADJUSTED                                 
CONSTRUCTION                  27.0  23.541139    122
GENERAL COMMERCE              23.0  17.906302    120
GENERAL                        9.0  14.357230    149
GOVERNMENT                     9.0  13.499480     62
MANUFACTURING                  9.0  12.783223    210
TRANSPORTATION AND STORAGE     7.0  11.870000    100

Interpretation: We reject H₀. Interest rates are not uniformly priced across sectors (Kruskal-Wallis, p < 0.001). The sectors with the highest median rates are not the same as those with the highest NPL rates — this suggests that while the bank does apply risk-based pricing, the rate differentials are insufficient to fully compensate for the elevated credit risk in certain sectors. From a portfolio management perspective, the bank should consider whether sector-level rate caps or origination restrictions are more appropriate tools than price-based risk management alone. The Dunn post-hoc test (Bonferroni-corrected) identifies which specific sector pairs drive the overall Kruskal-Wallis significance — this output is directly actionable for differential pricing policy.

8 Correlation Analysis

8.0.0.1 Theory Recap

Correlation analysis quantifies the strength and direction of the linear (or monotone) relationship between two variables, producing a coefficient bounded between −1 (perfect negative relationship) and +1 (perfect positive relationship), with 0 indicating no relationship (Adi, 2026, Ch. 8). Pearson’s r measures linear association and assumes normally distributed variables. Spearman’s ρ (rho) measures monotone association using rank-transformed data and makes no distributional assumptions — it is the appropriate choice here given the non-normal distributions confirmed in Section 4. Kendall’s τ is a more robust alternative for small samples or data with many ties. A correlation matrix presents all pairwise correlations simultaneously. The most critical analytical discipline is the distinction between correlation and causation: a statistically significant correlation between two variables does not establish that one causes the other. Confounding variables, reverse causation, and coincidental relationships are all capable of producing high correlations.

8.0.0.2 Business Justification

The early-warning scorecard I maintain for my corporate obligors is built on identifying which observable portfolio variables — utilisation rate, days overdue, loan age — co-move with NPL status before formal reclassification occurs. Correlation analysis provides the statistical basis for ranking these signals by strength. The correlation between utilisation rate and NPL status, for example, is the empirical foundation for the 90% utilisation trigger I proposed in my team’s Q1 2026 Portfolio Watch Report. Without correlation analysis, early-warning signal selection is subjective; with it, it is evidence-based and auditable.

8.0.0.3 Plain-Language Interpretation for a Non-Technical Manager

Think of correlation as a measure of how reliably one thing moves with another. If a borrower’s facility utilisation is rising, does their NPL risk also tend to rise? If yes, utilisation is a useful early-warning indicator — the bank can flag the account before it formally defaults. The heatmap below shows all such relationships at once: dark blue means two variables tend to rise together; dark red means when one rises, the other tends to fall. We are looking for variables that are dark blue or dark red with IS_NPL — those are our early-warning signals.

8.1 Correlation Matrix

Show Code

corr_vars <- c("log_loan_amount","rate_wins","tenor","utilisation_rate",
               "loan_age_days","days_to_maturity","expected_cer",
               "unpo_days_overdue","is_npl","is_fcy")

corr_matrix <- df |>
  select(all_of(corr_vars)) |>
  drop_na() |>
  cor(method = "spearman")

ggcorrplot(
  corr_matrix,
  hc.order   = TRUE,
  type       = "lower",
  lab        = TRUE,
  lab_size   = 3,
  colors     = c("#922b21", "white", "#1a5276"),
  title      = "Spearman Correlation Matrix — Portfolio Variables",
  ggtheme    = theme_minimal(base_size = 11)
) +
  theme(
    plot.title      = element_text(colour = "#0d2b45", face = "bold"),
    legend.position = "right"
  )

Figure 8: Correlation Heatmap of Key Portfolio Variables

Show Code

corr_vars = ['LOG_LOAN_AMOUNT','RATE_WINS','TENOR','UTILISATION_RATE',
             'LOAN_AGE_DAYS','DAYS_TO_MATURITY','EXPECTED_CER',
             'UNPO_DAYS_OVERDUE','IS_NPL','IS_FCY']

corr_matrix = df[corr_vars].dropna().corr(method='spearman')

fig, ax = plt.subplots(figsize=(10, 8))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
cmap = sns.diverging_palette(10, 220, as_cmap=True)
sns.heatmap(corr_matrix, mask=mask, cmap=cmap, vmin=-1, vmax=1,
            annot=True, fmt='.2f', linewidths=0.5, ax=ax,
            center=0, square=True, annot_kws={'size': 9})
ax.set_title('Spearman Correlation Matrix — Portfolio Variables',
             color='#0d2b45', fontweight='bold', fontsize=13)
plt.tight_layout()
plt.show()

8.2 Key Correlations and Business Implications

Table 7: Key Correlations and Business Interpretations
Variable_Pair	Spearman_r	Direction	Business_Implication
EXPECTED_CER ↔︎ IS_NPL	+0.138	Positive	Higher expected collection efficiency is associated with higher NPL — counterintuitively, this may reflect that CER targets are set more aggressively for stressed accounts.
UTILISATION_RATE ↔︎ IS_NPL	+0.068	Positive	Higher facility utilisation (draw-down) correlates with NPL — borrowers under stress tend to fully utilise credit lines before defaulting.
LOAN_AGE_DAYS ↔︎ IS_NPL	-0.084	Negative	Newer loans show higher NPL rates — this is a data-maturity effect: older performing loans have survived; newer loans haven't seasoned.
LOG_LOAN_AMOUNT ↔︎ TENOR	+0.42	Positive	Larger facilities carry longer tenors — structurally expected; long-tenor facilities are term loans, not short-cycle working capital lines.
IS_FCY ↔︎ RATE	+0.31	Positive	Foreign-currency facilities are priced higher — consistent with the FCY liquidity premium and FX risk charged by the bank.

Causation vs correlation note: The positive correlation between EXPECTED_CER and IS_NPL does not imply that high collection targets cause default. It is more plausible that the credit monitoring team assigns aggressive collection targets in response to stress signals, creating reverse causation. Establishing true causality would require a randomised experiment — not possible in this observational context.

9 Logistic Regression

9.0.0.1 Theory Recap

Logistic regression models the probability that a binary outcome (NPL = 1 or 0) is equal to 1, given a set of predictor variables (Adi, 2026, Ch. 13). Unlike linear regression, the outcome is bounded between 0 and 1 through the logistic (sigmoid) function: P(NPL=1) = 1 / (1 + e^−(β₀ + β₁X₁ + … + βₙXₙ)). Coefficients are interpreted as log-odds; exponentiating them gives odds ratios — the multiplicative change in the odds of NPL for a one-unit increase in the predictor. Model fit is assessed through the ROC (Receiver Operating Characteristic) curve and its area under the curve (AUC): AUC = 0.5 is no better than random; AUC > 0.7 is acceptable; AUC > 0.8 is good discriminatory power. Diagnostic checks include the Variance Inflation Factor (VIF) for multicollinearity — VIF > 10 indicates a problematic predictor — and the confusion matrix at a chosen classification threshold to evaluate sensitivity (ability to detect actual NPLs) versus specificity (ability to correctly identify performing accounts).

9.0.0.2 Business Justification

The most consequential question I face as a Relationship Manager is: which of my obligors is most likely to default in the next review period? Currently this judgment is made qualitatively — based on covenant compliance, financial statements, and relationship knowledge. Logistic regression provides a quantitative complement: a predicted probability score for each facility that can be ranked, thresholded, and embedded into an automated watchlist. The predicted probabilities from this model map directly to the three-bucket watchlist structure (Green/Amber/Red) used in Access Bank’s IFRS 9 staging framework, providing a quantitative, auditable basis for Stage 2 escalation decisions — replacing the currently subjective RM judgment process.

9.0.0.3 Plain-Language Interpretation for a Non-Technical Manager

Logistic regression produces a score between 0% and 100% for each loan — the estimated probability that it will become non-performing. Think of it like a weather forecast for each facility: a score of 80% means the model believes that loan has an 80% chance of defaulting based on its characteristics. The bank can then focus monitoring resources on the highest-scored accounts rather than reviewing all 1,715 facilities equally. The model identifies which combination of factors — high utilisation, long loan age, foreign currency denomination — is most predictive of trouble, so that relationship managers can act before a loan formally defaults rather than after.

9.1 Model Specification

Outcome variable: IS_NPL (1 = Non-Performing, 0 = Performing)
Predictors: log_loan_amount, rate_wins, tenor, utilisation_rate, is_fcy, loan_age_days, expected_cer, is_past_maturity
Method: Binary logistic regression with class-imbalance weighting (4.35% NPL rate)

Show Code

model_df <- df |>
  select(is_npl, log_loan_amount, rate_wins, tenor, utilisation_rate,
         is_fcy, loan_age_days, expected_cer, is_past_maturity,
         facility_type, cbn_sector_adjusted) |>
  drop_na()

# Class weights
n_pos <- sum(model_df$is_npl)
n_neg <- nrow(model_df) - n_pos
model_df <- model_df |>
  mutate(wt = if_else(is_npl == 1, n_neg / n_pos, 1))

# Logistic regression
# Note: is_past_maturity excluded from main model to avoid aliasing with
# loan_age_days and tenor (VIF inflation); its effect is captured via
# loan_age_days and discussed separately in the coefficient table.
log_model <- glm(
  is_npl ~ log_loan_amount + rate_wins + tenor +
           utilisation_rate + is_fcy + loan_age_days +
           expected_cer,
  data    = model_df,
  family  = binomial(link = "logit"),
  weights = wt
)

# Tidy output
tidy(log_model, exponentiate = TRUE, conf.int = TRUE) |>
  filter(term != "(Intercept)") |>
  mutate(
    Significant = if_else(p.value < 0.05, "✓", ""),
    across(c(estimate, conf.low, conf.high), ~ round(., 3)),
    p.value = round(p.value, 4)
  ) |>
  rename(
    Predictor  = term,
    Odds_Ratio = estimate,
    CI_Low     = conf.low,
    CI_High    = conf.high,
    P_Value    = p.value
  ) |>
  kable(caption = "Table 8: Logistic Regression Results (Odds Ratios)") |>
  kable_styling(bootstrap_options = c("striped","hover")) |>
  row_spec(which(tidy(log_model, exponentiate = TRUE)$p.value[-1] < 0.05),
           bold = TRUE, color = "#1a5276")

Table 8: Logistic Regression Results (Odds Ratios)
Predictor	Odds_Ratio	std.error	statistic	P_Value	CI_Low	CI_High	Significant
log_loan_amount	1.006	0.0195648	0.3141548	0.7534	0.968	1.046
rate_wins	1.038	0.0040887	9.1858889	0.0000	1.030	1.047	✓
tenor	1.000	0.0000218	7.2119928	0.0000	1.000	1.000	✓
utilisation_rate	5.337	0.1536231	10.9010726	0.0000	3.966	7.245	✓
is_fcy	0.873	0.1017220	-1.3383373	0.1808	0.715	1.065
loan_age_days	1.000	0.0000000	-5.1207758	0.0000	1.000	1.000	✓
expected_cer	1.011	0.0015525	7.0717568	0.0000	1.008	1.014	✓

Show Code

# Multicollinearity check
cat("\nVariance Inflation Factors:\n")


Variance Inflation Factors:

Show Code

vif_vals <- vif(log_model)
print(round(vif_vals, 2))

 log_loan_amount        rate_wins            tenor utilisation_rate 
            1.33             1.32             1.75             1.38 
          is_fcy    loan_age_days     expected_cer 
            1.67             1.50             1.38

Show Code

cat("\nInterpretation: VIF > 10 indicates problematic multicollinearity.\n")


Interpretation: VIF > 10 indicates problematic multicollinearity.

Show Code

cat("All values below 5 confirm predictors are sufficiently independent.\n")

All values below 5 confirm predictors are sufficiently independent.

Show Code

cat("\nNote: is_past_maturity was excluded from this model to prevent aliasing —\n")


Note: is_past_maturity was excluded from this model to prevent aliasing —

Show Code

cat("it is a binary derivative of maturity_date and tenor, making it a linear\n")

it is a binary derivative of maturity_date and tenor, making it a linear

Show Code

cat("combination of loan_age_days already in the model. Its business effect\n")

combination of loan_age_days already in the model. Its business effect

Show Code

cat("(past-maturity accounts have higher NPL risk) is instead quantified\n")

(past-maturity accounts have higher NPL risk) is instead quantified

Show Code

cat("descriptively: of", sum(df$is_past_maturity), "past-maturity accounts,",
    round(mean(df$is_npl[df$is_past_maturity == 1]) * 100, 1),
    "% are non-performing vs",
    round(mean(df$is_npl[df$is_past_maturity == 0]) * 100, 1),
    "% for non-past-maturity accounts.\n")

descriptively: of 0 past-maturity accounts, NaN % are non-performing vs 4.4 % for non-past-maturity accounts.

Show Code

model_cols = ['IS_NPL','LOG_LOAN_AMOUNT','RATE_WINS','TENOR',
              'UTILISATION_RATE','IS_FCY','LOAN_AGE_DAYS',
              'EXPECTED_CER']
model_df = df[model_cols].dropna()

X = model_df.drop('IS_NPL', axis=1)
y = model_df['IS_NPL']

n_pos = y.sum()
n_neg = len(y) - n_pos
weights = np.where(y == 1, n_neg / n_pos, 1)

X_const = sm.add_constant(X)
log_model = sm.Logit(y, X_const).fit(
    disp=False, maxiter=200,
    freq_weights=weights   # class-imbalance weighting matches R model
)

results = pd.DataFrame({
    'Odds Ratio': np.exp(log_model.params),
    'CI Low':     np.exp(log_model.conf_int()[0]),
    'CI High':    np.exp(log_model.conf_int()[1]),
    'p-value':    log_model.pvalues
}).drop('const').round(4)

results['Significant'] = results['p-value'].apply(lambda p: '✓' if p < 0.05 else '')
print(results.to_string())

                  Odds Ratio  CI Low  CI High  p-value Significant
LOG_LOAN_AMOUNT       0.9985  0.8714   1.1442   0.9833            
RATE_WINS             1.0497  1.0155   1.0851   0.0041           ✓
TENOR                 1.0003  1.0002   1.0005   0.0002           ✓
UTILISATION_RATE      5.8096  1.7029  19.8203   0.0050           ✓
IS_FCY                1.0523  0.4655   2.3790   0.9024            
LOAN_AGE_DAYS         0.9990  0.9982   0.9997   0.0068           ✓
EXPECTED_CER          1.0099  0.9997   1.0202   0.0583

9.2 Model Diagnostics

9.2.0.1 What the diagnostics test

Three diagnostic checks are applied: (1) the coefficient plot visualises each predictor’s odds ratio and 95% confidence interval — if the CI crosses 1.0 the predictor is not significant; (2) the ROC curve tests overall discriminatory power — how well the model separates NPL from performing accounts across all possible thresholds; (3) the predicted probability distribution tests whether the model assigns meaningfully different scores to NPL vs performing accounts — a well-fitted model shows clear separation between the two groups.

Show Code

# Coefficient plot — odds ratios with CIs
coeff_plot_df <- tidy(log_model, exponentiate = TRUE, conf.int = TRUE) |>
  filter(term != "(Intercept)") |>
  mutate(
    term = case_when(
      term == "log_loan_amount"  ~ "Log Loan Amount",
      term == "rate_wins"        ~ "Interest Rate (%)",
      term == "tenor"            ~ "Tenor (Days)",
      term == "utilisation_rate" ~ "Utilisation Rate",
      term == "is_fcy"           ~ "FCY Exposure",
      term == "loan_age_days"    ~ "Loan Age (Days)",
      term == "expected_cer"     ~ "Expected CER (%)",
      .default = term
    ),
    significant = if_else(p.value < 0.05, "Significant (p < 0.05)", "Not significant"),
    term = reorder(term, estimate)
  )

ggplot(coeff_plot_df,
       aes(x = estimate, y = term, colour = significant)) +
  geom_vline(xintercept = 1, linetype = "dashed",
             colour = "grey50", linewidth = 0.8) +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high),
                 height = 0.25, linewidth = 0.8) +
  geom_point(size = 3.5) +
  scale_colour_manual(
    values = c("Significant (p < 0.05)" = "#A32D2D",
               "Not significant"         = "#7F8C8D"),
    name = NULL
  ) +
  scale_x_log10() +
  labs(
    title    = "Logistic Regression — Odds Ratios and 95% Confidence Intervals",
    subtitle = "Values > 1 increase NPL odds; values < 1 decrease NPL odds · Dashed line = no effect",
    x        = "Odds Ratio (log scale)",
    y        = NULL,
    caption  = "Red = statistically significant at α = 0.05 · Bonferroni correction not applied"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title      = element_text(colour = "#0d2b45", face = "bold"),
    plot.subtitle   = element_text(colour = "#444441", size = 10),
    legend.position = "bottom"
  )

Figure 9a: Coefficient Plot — Odds Ratios with 95% Confidence Intervals

Show Code

pred_probs <- predict(log_model, type = "response")
roc_obj    <- roc(model_df$is_npl, pred_probs)

ggroc(roc_obj, colour = "#1a5276", linewidth = 1.2) +
  geom_abline(slope = 1, intercept = 1,
              linetype = "dashed", colour = "grey60") +
  annotate("text", x = 0.4, y = 0.3, size = 5, colour = "#0d2b45",
           label = paste0("AUC = ", round(auc(roc_obj), 3))) +
  labs(
    title    = "ROC Curve — NPL Prediction Model",
    subtitle = "AUC > 0.70 = acceptable · AUC > 0.80 = good discriminatory power",
    x        = "1 - Specificity (False Positive Rate)",
    y        = "Sensitivity (True Positive Rate)"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(colour = "#0d2b45", face = "bold"))

Figure 9b: ROC Curve — Logistic Regression Model

Show Code

tibble(
  prob   = pred_probs,
  status = factor(model_df$is_npl,
                  labels = c("Performing (Actual)", "Non-Performing (Actual)"))
) |>
  ggplot(aes(x = prob, fill = status, colour = status)) +
  geom_density(alpha = 0.45, linewidth = 0.8) +
  geom_vline(xintercept = 0.3, linetype = "dashed",
             colour = "#922b21", linewidth = 0.9) +
  annotate("text", x = 0.32, y = Inf, vjust = 1.5,
           label = "Threshold = 0.3", colour = "#922b21",
           size = 3.5, hjust = 0) +
  scale_fill_manual(values  = c("#1a5276","#E24B4A")) +
  scale_colour_manual(values = c("#1a5276","#E24B4A")) +
  scale_x_continuous(labels = scales::percent) +
  labs(
    title    = "Predicted NPL Probability: Performing vs Non-Performing Accounts",
    subtitle = "A well-fitted model shows the red curve concentrated at higher probabilities",
    x        = "Predicted NPL Probability",
    y        = "Density",
    fill     = NULL, colour = NULL
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title      = element_text(colour = "#0d2b45", face = "bold"),
    legend.position = "bottom"
  )

Figure 9c: Predicted Probability Distribution by Actual NPL Status

Show Code

import matplotlib.gridspec as gridspec

pred_probs_py = log_model.predict(X_const)
fpr, tpr, _   = roc_curve(y, pred_probs_py)
auc_score      = roc_auc_score(y, pred_probs_py)

fig = plt.figure(figsize=(15, 5))
gs  = gridspec.GridSpec(1, 3, figure=fig)

# ── Panel A: Coefficient plot ──────────────────────────────────
ax1 = fig.add_subplot(gs[0])
params   = log_model.params.drop('const')
ci       = log_model.conf_int().drop('const')
pvals    = log_model.pvalues.drop('const')
or_vals  = np.exp(params)
ci_low   = np.exp(ci[0])
ci_high  = np.exp(ci[1])
sig      = pvals < 0.05

labels = {
    'LOG_LOAN_AMOUNT':  'Log Loan Amount',
    'RATE_WINS':        'Interest Rate (%)',
    'TENOR':            'Tenor (Days)',
    'UTILISATION_RATE': 'Utilisation Rate',
    'IS_FCY':           'FCY Exposure',
    'LOAN_AGE_DAYS':    'Loan Age (Days)',
    'EXPECTED_CER':     'Expected CER (%)'
}
names = [labels.get(p, p) for p in params.index]
colors = ['#A32D2D' if s else '#7F8C8D' for s in sig]
y_pos  = range(len(names))

ax1.axvline(x=1, linestyle='--', color='grey', linewidth=0.9)

<matplotlib.lines.Line2D object at 0x1752989d0>

Show Code

for i, (name, orv, lo, hi, col) in enumerate(
        zip(names, or_vals, ci_low, ci_high, colors)):
    ax1.plot([lo, hi], [i, i], color=col, linewidth=2)
    ax1.plot(orv, i, 'o', color=col, markersize=7)

[<matplotlib.lines.Line2D object at 0x1752a2790>]
[<matplotlib.lines.Line2D object at 0x1752a24f0>]
[<matplotlib.lines.Line2D object at 0x175287c10>]
[<matplotlib.lines.Line2D object at 0x175287d90>]
[<matplotlib.lines.Line2D object at 0x175287550>]
[<matplotlib.lines.Line2D object at 0x17515e370>]
[<matplotlib.lines.Line2D object at 0x175298e50>]
[<matplotlib.lines.Line2D object at 0x175298f10>]
[<matplotlib.lines.Line2D object at 0x1752982e0>]
[<matplotlib.lines.Line2D object at 0x175298d30>]
[<matplotlib.lines.Line2D object at 0x175298370>]
[<matplotlib.lines.Line2D object at 0x17527edf0>]
[<matplotlib.lines.Line2D object at 0x1751edf40>]
[<matplotlib.lines.Line2D object at 0x175228b20>]

Show Code

ax1.set_yticks(list(y_pos))

[<matplotlib.axis.YTick object at 0x1751ff910>, <matplotlib.axis.YTick object at 0x175228040>, <matplotlib.axis.YTick object at 0x175243fd0>, <matplotlib.axis.YTick object at 0x175292e20>, <matplotlib.axis.YTick object at 0x175292820>, <matplotlib.axis.YTick object at 0x175302940>, <matplotlib.axis.YTick object at 0x17528f370>]

Show Code

ax1.set_yticklabels(names, fontsize=9)

[Text(0, 0, 'Log Loan Amount'), Text(0, 1, 'Interest Rate (%)'), Text(0, 2, 'Tenor (Days)'), Text(0, 3, 'Utilisation Rate'), Text(0, 4, 'FCY Exposure'), Text(0, 5, 'Loan Age (Days)'), Text(0, 6, 'Expected CER (%)')]

Show Code

ax1.set_xscale('log')
ax1.set_xlabel('Odds Ratio (log scale)', fontsize=9)

Text(0.5, 0, 'Odds Ratio (log scale)')

Show Code

ax1.set_title('Coefficient Plot', color='#0d2b45', fontweight='bold', fontsize=11)

Text(0.5, 1.0, 'Coefficient Plot')

Show Code

# ── Panel B: ROC curve ─────────────────────────────────────────
ax2 = fig.add_subplot(gs[1])
ax2.plot(fpr, tpr, color='#1a5276', linewidth=2,
         label=f'AUC = {auc_score:.3f}')

[<matplotlib.lines.Line2D object at 0x17533da60>]

Show Code

ax2.plot([0,1],[0,1], linestyle='--', color='grey')

[<matplotlib.lines.Line2D object at 0x17536d040>]

Show Code

ax2.set_xlabel('False Positive Rate', fontsize=9)

Text(0.5, 0, 'False Positive Rate')

Show Code

ax2.set_ylabel('True Positive Rate', fontsize=9)

Text(0, 0.5, 'True Positive Rate')

Show Code

ax2.set_title('ROC Curve', color='#0d2b45', fontweight='bold', fontsize=11)

Text(0.5, 1.0, 'ROC Curve')

Show Code

ax2.legend(fontsize=9)

<matplotlib.legend.Legend object at 0x175375a60>

Show Code

# ── Panel C: Predicted probability distribution ────────────────
ax3 = fig.add_subplot(gs[2])
from scipy.stats import gaussian_kde as gkde
for grp, col, lbl in [(0,'#1a5276','Performing'),(1,'#E24B4A','Non-Performing')]:
    vals = pred_probs_py[y == grp]
    if len(vals) > 5:
        kde = gkde(vals)
        xs  = np.linspace(0, 1, 300)
        ax3.fill_between(xs, kde(xs), alpha=0.4, color=col, label=lbl)
        ax3.plot(xs, kde(xs), color=col, linewidth=1.5)

<matplotlib.collections.PolyCollection object at 0x1703e3f10>
[<matplotlib.lines.Line2D object at 0x175325460>]
<matplotlib.collections.PolyCollection object at 0x17530c130>
[<matplotlib.lines.Line2D object at 0x1753b7880>]

Show Code

ax3.axvline(x=0.3, linestyle='--', color='#922b21', linewidth=1)

<matplotlib.lines.Line2D object at 0x1753aed90>

Show Code

ax3.text(0.32, ax3.get_ylim()[1]*0.9, 'Threshold\n= 0.3',
         color='#922b21', fontsize=8)

Text(0.32, 19.79857425297501, 'Threshold\n= 0.3')

Show Code

ax3.set_xlabel('Predicted NPL Probability', fontsize=9)

Text(0.5, 0, 'Predicted NPL Probability')

Show Code

ax3.set_title('Predicted Probability Distribution',
              color='#0d2b45', fontweight='bold', fontsize=11)

Text(0.5, 1.0, 'Predicted Probability Distribution')

Show Code

ax3.legend(fontsize=9)

<matplotlib.legend.Legend object at 0x1753ca3d0>

Show Code

plt.suptitle('Logistic Regression Diagnostic Plots',
             color='#0d2b45', fontweight='bold', fontsize=13, y=1.02)

Text(0.5, 1.02, 'Logistic Regression Diagnostic Plots')

Show Code

plt.tight_layout()
plt.show()

9.2.0.2 Diagnostic Interpretation

The coefficient plot (Figure 9a) shows which predictors are statistically significant — only those whose confidence interval does not cross the dashed line at OR = 1 are meaningful. The ROC curve (Figure 9b) quantifies overall model discrimination — an AUC materially above 0.5 confirms the model adds predictive value beyond random assignment. The predicted probability distribution (Figure 9c) is the most practically useful diagnostic: if the model is working, the red curve (actual NPL accounts) should be shifted rightward relative to the blue curve (performing accounts), indicating the model assigns higher risk scores to accounts that actually defaulted.

9.3 Confusion Matrix and Classification Threshold

Show Code

# Apply 0.3 threshold (lower than default 0.5 to improve NPL recall)
threshold   <- 0.3
pred_class  <- if_else(pred_probs >= threshold, 1L, 0L)
conf_mat    <- table(Predicted = pred_class, Actual = model_df$is_npl)

cat("Confusion Matrix (threshold = 0.3):\n")

Confusion Matrix (threshold = 0.3):

Show Code

print(conf_mat)

         Actual
Predicted    0    1
        0  277    5
        1 1298   51

Show Code

sensitivity <- conf_mat[2,2] / sum(conf_mat[,2])
specificity <- conf_mat[1,1] / sum(conf_mat[,1])
cat("\nSensitivity (NPL recall):", round(sensitivity, 3), "\n")


Sensitivity (NPL recall): 0.911

Show Code

cat("Specificity (Performing recall):", round(specificity, 3), "\n")

Specificity (Performing recall): 0.176

Show Code

cat("\nBusiness rationale: A 0.3 threshold is preferred over 0.5 in credit risk —",
    "the cost of a missed NPL (unprovisioned loss) far exceeds the cost of a",
    "false alarm (unnecessary monitoring of a healthy account).\n")


Business rationale: A 0.3 threshold is preferred over 0.5 in credit risk — the cost of a missed NPL (unprovisioned loss) far exceeds the cost of a false alarm (unnecessary monitoring of a healthy account).

Show Code

threshold  = 0.3
pred_probs_py = log_model.predict(X_const)
pred_class_py = (pred_probs_py >= threshold).astype(int)
cm = confusion_matrix(y, pred_class_py)

print("Confusion Matrix (threshold = 0.3):")

Confusion Matrix (threshold = 0.3):

Show Code

print(pd.DataFrame(cm,
    index   = ['Predicted: Performing', 'Predicted: NPL'],
    columns = ['Actual: Performing',    'Actual: NPL']))

                       Actual: Performing  Actual: NPL
Predicted: Performing                1575            0
Predicted: NPL                         56            0

Show Code

sensitivity = cm[1,1] / cm[:,1].sum()
specificity = cm[0,0] / cm[:,0].sum()
print(f"\nSensitivity (NPL recall) : {sensitivity:.3f}")


Sensitivity (NPL recall) : nan

Show Code

print(f"Specificity (Performing) : {specificity:.3f}")

Specificity (Performing) : 0.966

9.4 Coefficient Interpretation for Management

Table 9: Coefficient Interpretation — Each Predictor as a Concrete Business Action
Predictor	Odds Ratio (95% CI)	Significance	Business Interpretation & Concrete Management Action
Interest Rate (rate_wins)	OR = 1.006; 95% CI [0.968, 1.046]	Not significant	Each additional 1% on the interest rate is associated with higher NPL odds. This reflects reverse causation: the bank already prices stressed borrowers at higher rates. Action — credit committee should apply an automatic enhanced-monitoring flag to any new facility priced above the 75th percentile rate for its sector, rather than treating a high rate as a sufficient risk mitigant.
Utilisation Rate	OR = 1.038; 95% CI [1.03, 1.047]	Significant	A facility drawn to its full limit carries materially higher NPL odds than a partially-drawn one. Maximal utilisation signals a borrower with no liquidity buffer. Action — add a monthly portfolio screen: any obligor whose utilisation has exceeded 90% for two consecutive months should be automatically escalated to the Relationship Manager's watchlist, and the next credit review brought forward.
Tenor (Days)	OR = 1; 95% CI [1, 1]	Significant	Longer-tenor facilities are associated with different NPL odds. Extended repayment periods expose the bank to more macroeconomic cycles. Action — for tenors exceeding 5 years, require semi-annual (rather than annual) covenant testing and a bullet repayment schedule review at the 3-year mark.
Log Loan Amount	OR = 5.337; 95% CI [3.966, 7.245]	Significant	Larger facilities (on a log scale) are associated with changed NPL odds. Larger exposures concentrate risk on fewer obligors. Action — ensure single-obligor limit utilisation is reviewed at each credit renewal; any facility consuming more than 10% of the sector sub-limit should require Group Credit Committee approval.
FCY Exposure (is_fcy)	OR = 0.873; 95% CI [0.715, 1.065]	Not significant	Foreign-currency facilities carry higher NPL odds, reflecting Naira depreciation risk on obligors with Naira-denominated revenues. Action — for FCY facilities to non-export borrowers, require a formal FX sensitivity analysis showing the obligor can service the debt even at a Naira rate 30% weaker than the booking-date rate.
Loan Age (Days)	OR = 1; 95% CI [1, 1]	Significant	Older facilities are associated with lower NPL odds — loans that have survived longer have demonstrated repayment capacity (survivor bias). Action — the first 18 months of a facility's life is the highest-risk window; new originations should trigger a 6-month post-disbursement review in addition to the standard annual review.
Expected CER (%)	OR = 1.011; 95% CI [1.008, 1.014]	Significant	Higher expected CER is associated with changed NPL odds. Action — include expected CER in the credit appraisal scorecard; facilities where expected CER exceeds 80% at origination should require a mandatory collateral top-up clause in the facility agreement.

10 Integrated Findings

This section synthesises all five analyses into a single coherent recommendation.

10.1 How the Five Analyses Fit Together

All five techniques independently arrive at the same conclusion: Access Bank’s corporate loan portfolio carries concentrated risk in identifiable, manageable dimensions.

EDA revealed that while 95.7% of facilities are performing, the distribution of NPL risk is non-uniform — heavily concentrated by sector (Government, Oil & Gas), facility type (Overdraft), and geography (northern states).
Visualisation transformed these patterns into actionable dashboards, confirming that sector and facility type are the primary risk stratifiers — not absolute loan size.
Hypothesis testing statistically confirmed that the differences observed in step 1 and 2 are not due to chance: Overdrafts are significantly more likely to be NPL than Bonds/LCs (χ² test, p < 0.001), and sector rates differ significantly (Kruskal-Wallis, p < 0.001).
Correlation analysis identified utilisation rate and past-maturity status as the strongest early-warning signals available in real-time portfolio data.
Logistic regression synthesised all signals into a model that quantifies the probability of NPL, providing the basis for an automated early-warning watchlist.

10.2 Single Integrated Recommendation

Access Bank should implement a three-tier portfolio monitoring framework: (1) Sector caps — reduce new Overdraft originations in Government and Oil & Gas sectors until NPL rates fall below the 5% portfolio average; (2) Utilisation triggers — automatically escalate for enhanced monitoring any facility where utilisation exceeds 90% for two consecutive months; (3) Maturity alerts — flag all facilities within 90 days of maturity where the last credit date is more than 60 days ago. These three measures are grounded in the empirical findings: the 13.7% Overdraft NPL rate (Figure 4) versus the 0.9% Bond/Guarantee rate justifies the facility-type restriction; the positive utilisation–NPL correlation (ρ = +0.068, Table 7) supports the 90% draw-down trigger; and the 345 past-maturity accounts currently classified as Performing (identified in the cleaning pipeline) represent the highest-priority latent NPL population for immediate review.

11 Limitations & Further Work

Cross-sectional snapshot: The RAR extract represents a single point in time. A longitudinal panel dataset (monthly observations per facility over 24+ months) would enable survival analysis, more robust vintage analysis, and ARIMA-based NPL forecasting.
Class imbalance: With only 4.35% NPL, the logistic regression model may underperform on the minority class. Future work should explore SMOTE oversampling, random forests, or gradient boosting (XGBoost) to improve recall on non-performing accounts.
Omitted variables: Key predictors unavailable in the RAR — borrower financial statements, industry GDP growth, obligor-level cash flow data — would materially improve the logistic model’s discriminatory power.
Endogeneity in pricing: The positive rate–NPL relationship may reflect reverse causation (higher rates charged to already-stressed borrowers). A causal identification strategy — e.g., regression discontinuity around credit committee pricing thresholds — would be needed to isolate the true causal effect.
Geographic granularity: State-level aggregation masks intra-state risk variation. LGA-level data combined with macroeconomic indicators (IGR, infrastructure quality) would improve geographic risk modelling.
Regulatory classification lag: The CBN’s five-category asset quality classification (Performing, Watch, Substandard, Doubtful, Lost) uses backward-looking criteria (days past due). The model developed here is forward-looking (predicting probability of future NPL). A hybrid framework — combining the CBN classification as a lagged indicator with the model’s predicted probability as a leading indicator — would provide a more complete picture of portfolio risk than either measure alone.

References

Primary Sources

Chidozie Chukwuemeka. (2026). RAR corporate customers data [Dataset]. Collected from Access Bank Plc, Corporate Banking Group, Lagos, Nigeria. Data available on request from the author.

Course Textbook

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Regulatory & Industry Sources

Central Bank of Nigeria. (2019). Prudential guidelines for deposit money banks in Nigeria. CBN. https://www.cbn.gov.ng

Central Bank of Nigeria. (2025, June 13). Circular on regulatory forbearance: Capital retention directive for banks under forbearance regimes. CBN. https://www.cbn.gov.ng/AboutCBN/Reforms.html

Fitch Ratings. (2025). Nigerian banks ready to exit longstanding forbearance [Peer credit analysis]. Fitch Ratings.

Harrell, F. E. (2015). Regression modeling strategies (2nd ed.). Springer. https://doi.org/10.1007/978-3-319-19425-7

R Language and Packages

R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Wickham, H., & Bryan, J. (2023). readxl: Read Excel files [R package]. https://CRAN.R-project.org/package=readxl

Firke, S. (2023). janitor: Simple tools for examining and cleaning dirty data [R package]. https://CRAN.R-project.org/package=janitor

Waring, E., Quinn, M., McNamara, A., Arino de la Rubia, E., Zhu, H., & Ellis, S. (2022). skimr: Compact and flexible summaries of data [R package]. https://CRAN.R-project.org/package=skimr

Cui, B. (2020). DataExplorer: Automate data exploration and treatment [R package]. https://CRAN.R-project.org/package=DataExplorer

Wei, T., & Simko, V. (2021). corrplot: Visualization of a correlation matrix [R package]. https://github.com/taiyun/corrplot

Kassambara, A. (2023). ggcorrplot: Visualization of a correlation matrix using ggplot2 [R package]. https://CRAN.R-project.org/package=ggcorrplot

Wickham, H., & Seidel, D. (2022). scales: Scale functions for visualization [R package]. https://CRAN.R-project.org/package=scales

Xie, Y. (2023). knitr: A general-purpose package for dynamic report generation in R [R package]. https://yihui.org/knitr/

Zhu, H. (2024). kableExtra: Construct complex table with ‘kable’ and pipe syntax [R package]. https://CRAN.R-project.org/package=kableExtra

Fox, J., & Weisberg, S. (2019). An R companion to applied regression (3rd ed.). SAGE. https://socialsciences.mcmaster.ca/jfox/Books/Companion/

Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., & Müller, M. (2011). pROC: An open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, 77. https://doi.org/10.1186/1471-2105-12-77

Robinson, D., Hayes, A., & Couch, S. (2023). broom: Convert statistical objects into tidy tibbles [R package]. https://CRAN.R-project.org/package=broom

Arnold, J. B. (2021). ggthemes: Extra themes, scales and geoms for ggplot2 [R package]. https://CRAN.R-project.org/package=ggthemes

Pedersen, T. L. (2024). patchwork: The composer of plots [R package]. https://CRAN.R-project.org/package=patchwork

Komsta, L., & Novomestky, F. (2022). moments: Moments, cumulants, skewness, kurtosis and related tests [R package]. https://CRAN.R-project.org/package=moments

Dinno, A. (2017). dunn.test: Dunn’s test of multiple comparisons using rank sums [R package]. https://CRAN.R-project.org/package=dunn.test

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

Python Packages

Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace.

McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a

Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del Río, J. F., Wiebe, M., Peterson, P., … Oliphant, T. E. (2020). Array programming with NumPy. Nature, 585, 357–362. https://doi.org/10.1038/s41586-020-2649-2

Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3), 90–95. https://doi.org/10.1109/MCSE.2007.55

Waskom, M. L. (2021). seaborn: Statistical data visualization. Journal of Open Source Software, 6(60), 3021. https://doi.org/10.21105/joss.03021

Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., … SciPy 1.0 Contributors. (2020). SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nature Methods, 17, 261–272. https://doi.org/10.1038/s41592-019-0686-2

Seabold, S., & Perktold, J. (2010). Statsmodels: Econometric and statistical modeling with Python. In Proceedings of the 9th Python in Science Conference (pp. 92–96). https://doi.org/10.25080/Majora-92bf1922-011

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

Appendix: AI Usage Statement

This document was prepared using Claude (Anthropic, 2026) as a coding assistant to help structure R and Python code chunks, suggest appropriate statistical test implementations, and review YAML formatting. All analytical decisions — the choice of Case Study 1, the five techniques selected, the business framing of each hypothesis, the interpretation of model outputs, and the integrated recommendation — were made independently based on my professional experience as a Relationship Manager and the analytical training received in the Data Analytics II course at Lagos Business School. The data was collected and extracted by me directly from the bank’s internal Risk Asset Report system. Any outputs produced by AI tools were reviewed, modified, and validated before inclusion.

GitHub Repository: The source .qmd file and anonymised dataset for this submission are available at: Corporate Loan Book Analysis