---
title: "Drivers of Merchant TPV Target Attainment"
subtitle: "An Exploratory & Inferential Study of a Nigerian POS Lending Portfolio"
author: "Gloria Onosode"
date: today
abstract: |
This study examines what drives monthly transaction-volume (TPV) target
attainment in a Nigerian merchant lending portfolio of 285 disbursed loans
observed February-April 2026. Applying exploratory data analysis,
visualisation, hypothesis testing, correlation analysis and regression, the
analysis identifies the loan, merchant and operational characteristics
most strongly associated with achievement of monthly TPV targets and
translates the findings into concrete recommendations for sales,
underwriting and terminal-deployment decisions.
format:
html:
theme:
light: flatly
dark: darkly
toc: true
toc-depth: 3
toc-location: left
code-fold: true
code-tools: true
code-line-numbers: true
self-contained: true
fig-width: 8
fig-height: 5
fig-cap-location: top
tbl-cap-location: top
df-print: paged
number-sections: true
smooth-scroll: true
execute:
warning: false
message: false
echo: true
---
```{r setup}
#| include: false
library(tidyverse)
library(janitor)
library(lubridate)
library(scales)
library(gt)
library(gtsummary)
library(broom)
library(corrr)
library(car)
library(lmtest)
library(sandwich)
library(effectsize)
library(rstatix)
library(ggcorrplot)
library(pROC)
library(conflicted)
library(moments)
library(knitr)
conflicted::conflicts_prefer(
dplyr::filter,
dplyr::select,
dplyr::recode,
dplyr::lag,
stats::chisq.test,
.quiet = TRUE
)
theme_set(
theme_minimal(base_size = 11) +
theme(
plot.title.position = "plot",
plot.caption.position = "plot",
plot.caption = element_text(colour = "grey40"),
strip.text = element_text(face = "bold")
)
)
naira <- function(scale = 1, suffix = "") {
label_number(prefix = "N", scale = scale, suffix = suffix,
accuracy = 1, big.mark = ",")
}
```
# Executive Summary
This study investigates the drivers of monthly TPV target attainment across a
portfolio of 285 merchant POS loans observed between February and April 2026.
The data was extracted from the organisation's loan management and merchant
performance system, covering loans disbursed across four Nigerian states.
Exploratory analysis reveals that disbursed loan amount and current account
balance are the strongest predictors of TPV performance (together explaining
67% of variance in the regression model), while arrears rates vary
significantly by state. The analysis recommends that the sales team prioritise
merchants with high disbursed amounts and healthy account balances for upsizing,
and that the credit team flag IN_ARREARS merchants as a priority for early
intervention given their significantly lower TPV output.
# Professional Disclosure
## Role and organisational context
I am the **Sales and Business Director** at a Nigerian financial-services
organisation that provides working-capital loans to merchants operating POS
terminals. My function is accountable for the monthly commercial performance
of the disbursed merchant book - specifically, monitoring transaction processing
volume (TPV) against targets agreed at underwriting, directing terminal
deployment strategy, identifying merchants eligible for upsized facilities, and
presenting portfolio performance to executive management and the board.
## Operational relevance of the five techniques
**Exploratory Data Analysis** formalises the portfolio scan I conduct before
every monthly business review. I examine dead accounts (zero-TPV merchants
requiring sales follow-up), outlier loans whose performance distorts territory
averages, and concentration risk across loan sizes, states and arrears buckets.
**Data Visualisation** produces the artefacts I present in monthly performance
decks to the CEO and board members who are not analysts. The five charts in
this study are deliberately the same chart types I use in those decks.
**Hypothesis Testing** allows me to challenge the implicit beliefs that drive
resource allocation in our sales function - for example, whether Lagos merchants
genuinely outperform other states or whether the difference is within sampling
noise.
**Correlation Analysis** tells me which inputs we control - terminal count, loan
size - move together with the outcomes we are measured on (TPV and achievement
%), and which relationships are confounded by underwriting decisions.
**Regression** delivers the conditional effect of each driver while holding the
others constant. Each significant coefficient maps to a one-line recommendation
in the memo I produce after a monthly review.
# Data Collection and Sampling
## Source
The dataset is an extract from our **loan management and merchant performance
system**, drawn by the data analytics team at my request in April 2026. As
Sales and Business Director, I am the primary business user of this data - I
review an equivalent extract every month to prepare performance reports for
executive management.
## Sampling frame
The sampling frame is all merchant loans disbursed between late 2024 and late
2025 that were on the book during the February-April 2026 performance window.
Each row represents one disbursed loan. A small number of merchants appear more
than once because they have taken serial facilities.
## Time period covered
- **Performance window:** February 2026 - April 2026
- **Disbursement vintage:** late 2024 to late 2025
## Anonymisation and ethics
Merchant IDs are the system-generated anonymised codes from the LMS. No
merchant names, BVN, or personally identifiable information is present in the
rendered HTML. The dataset is used with the permission of the Head of Credit
Operations and in accordance with the Lagos Business School academic-integrity
statement.
## Sample-size justification
285 observations exceed the Case Study 1 minimum of 100 by a comfortable
margin, providing adequate statistical power for all five techniques applied.
# Data Description
## Data dictionary
```{r data-dictionary}
data_dictionary <- tribble(
~Variable, ~Type, ~Unit, ~Definition,
"merchant_id", "character", "---", "Anonymised merchant identifier",
"disbursed_amount", "numeric", "NGN", "Principal of the loan disbursed",
"date_disbursed", "Date", "---", "Disbursement date",
"loan_status", "factor", "---", "CLOSED / ACTIVE / IN_ARREARS",
"state", "factor", "---", "Nigerian state of merchant operations",
"instalment_amount_single", "numeric", "NGN", "Single instalment on schedule",
"terminal_count", "integer", "count", "POS terminals deployed to merchant",
"tgt", "numeric", "NGN", "Monthly TPV target set at underwriting",
"feb26_tpv / mar26_tpv / apr26_tpv", "numeric", "NGN", "TPV in each month of the window",
"feb26_pct_achieved etc.", "numeric", "%", "TPV as percentage of target each month",
"current_account_balance", "numeric", "NGN", "Current-account balance at month end"
)
data_dictionary |>
gt() |>
tab_header(
title = "Variable definitions",
subtitle = "Cleaned dataset used for all analyses below"
) |>
cols_align("left")
```
## Data cleaning pipeline
```{r data-clean}
raw <- read_csv("DATAUSE.csv", show_col_types = FALSE) |> clean_names()
df <- raw |> dplyr::filter(!is.na(merchant_id), merchant_id != "")
parse_naira <- function(x) {
if (is.numeric(x)) return(x)
x |> as.character() |> str_remove_all("[,\\s]") |> as.numeric()
}
parse_pct <- function(x) {
if (is.numeric(x)) return(x)
x |> as.character() |> str_remove("%") |> as.numeric()
}
df <- df |>
mutate(
across(
c(disbursed_amount, instalment_amount_single, tgt,
feb26_tpv, mar26_tpv, apr26_tpv, mtd_tpv,
feb26_current_account_balance, mar26_current_account_balance,
apr26_current_account_balance,
mtd_current_account_balance, lmtd_current_account_balance),
parse_naira
),
across(
c(feb26_pct_achieved, mar26_pct_achieved, apr26_pct_achieved),
parse_pct
)
)
df <- df |>
mutate(
date_disbursed = dmy(str_remove(date_disbursed, "^[A-Za-z]+,\\s*"))
)
df_long <- df |>
pivot_longer(
cols = c(starts_with("feb26_"), starts_with("mar26_"),
starts_with("apr26_")),
names_to = c("month", ".value"),
names_pattern = "(feb26|mar26|apr26)_(.+)"
) |>
mutate(
month = case_when(
month == "feb26" ~ "2026-02-01",
month == "mar26" ~ "2026-03-01",
month == "apr26" ~ "2026-04-01"
) |> ymd(),
loan_status = factor(loan_status,
levels = c("CLOSED", "ACTIVE", "IN_ARREARS")),
state = factor(state)
)
```
## Sanity check
```{r data-sanity}
tibble(
Check = c("Loans (rows)", "Unique states", "Loan statuses",
"Earliest disbursal", "Latest disbursal",
"Median disbursed", "Mean disbursed"),
Value = c(
as.character(nrow(df)),
as.character(n_distinct(df$state)),
as.character(n_distinct(df$loan_status)),
format(min(df$date_disbursed, na.rm = TRUE), "%b %Y"),
format(max(df$date_disbursed, na.rm = TRUE), "%b %Y"),
scales::comma(median(df$disbursed_amount, na.rm = TRUE)),
scales::comma(round(mean(df$disbursed_amount, na.rm = TRUE)))
)
) |>
gt() |>
tab_header(title = "Cleaned-data sanity check")
```
## Missingness
```{r missingness}
miss <- df |>
summarise(across(everything(), ~ sum(is.na(.x)))) |>
pivot_longer(everything(),
names_to = "variable",
values_to = "n_missing") |>
dplyr::filter(n_missing > 0) |>
arrange(desc(n_missing))
if (nrow(miss) == 0) {
cat("No missing values in the cleaned dataset.\n")
} else {
miss |>
gt() |>
tab_header(title = "Variables with missing values")
}
```
# Technique 1 - Exploratory Data Analysis
## Theory recap
Exploratory data analysis is the diagnostic phase that precedes formal
modelling. Anscombe's quartet illustrates why summary statistics alone are
insufficient - four datasets with identical means and variances can have
entirely different shapes. EDA combines summary statistics, distribution plots,
missing-value mapping and outlier detection to verify the data before any model
assumptions are tested.
## Business justification
Before every monthly business review I scan the portfolio along three axes:
account activity (zero-TPV merchants requiring sales follow-up), geographic
and size concentration, and arrears emergence. As Sales and Business Director,
this scan directly informs which merchants I escalate to the CEO and which
states I flag for additional sales resource deployment.
## Summary statistics
```{r eda-summary}
df |>
dplyr::select(disbursed_amount, instalment_amount_single,
terminal_count, tgt) |>
tbl_summary(
statistic = list(all_continuous() ~ "{median} ({p25}, {p75})"),
missing = "ifany",
label = list(
disbursed_amount ~ "Disbursed amount (NGN)",
instalment_amount_single ~ "Single instalment (NGN)",
terminal_count ~ "POS terminals deployed",
tgt ~ "Monthly TPV target (NGN)"
)
) |>
as_gt() |>
tab_header(
title = "Loan-level summary statistics",
subtitle = "Median (Q1, Q3) shown"
)
```
```{r eda-skew-kurt}
df |>
dplyr::select(disbursed_amount, instalment_amount_single, tgt) |>
summarise(across(everything(),
list(
skew = ~ moments::skewness(.x, na.rm = TRUE),
kurt = ~ moments::kurtosis(.x, na.rm = TRUE)
))) |>
pivot_longer(everything(),
names_to = c("variable", ".value"),
names_pattern = "(.+)_(skew|kurt)") |>
gt() |>
fmt_number(c(skew, kurt), decimals = 2) |>
tab_header(title = "Skewness and excess kurtosis")
```
## Distributions of key numerics
```{r eda-distributions}
df_long |>
dplyr::select(disbursed_amount, tpv, pct_achieved,
current_account_balance) |>
pivot_longer(everything()) |>
dplyr::filter(!is.na(value), value > 0) |>
ggplot(aes(value)) +
geom_histogram(bins = 35, fill = "#1f78b4", alpha = 0.85) +
scale_x_log10(labels = label_comma()) +
facet_wrap(~ name, scales = "free", ncol = 2) +
labs(
title = "Distributions of key numerics (log10 scale)",
subtitle = "All four financial variables show heavy right skew",
x = NULL, y = "Count"
)
```
## Outlier detection
```{r eda-outliers}
iqr_flag <- function(x) {
q <- quantile(x, c(0.25, 0.75), na.rm = TRUE)
iqr <- diff(q)
x < (q[1] - 1.5 * iqr) | x > (q[2] + 1.5 * iqr)
}
df |>
transmute(
`Disbursed amount` = iqr_flag(disbursed_amount),
`Single instalment` = iqr_flag(instalment_amount_single),
`Terminal count` = iqr_flag(terminal_count)
) |>
summarise(across(everything(), ~ sum(.x, na.rm = TRUE))) |>
pivot_longer(everything(),
names_to = "Variable",
values_to = "Outliers (n)") |>
gt() |>
tab_header(title = "Outliers flagged by the 1.5 x IQR rule")
```
## Plain-language interpretation
The portfolio exhibits two notable data quality features. First, several
merchants record zero TPV in one or more months - these represent genuinely
dormant accounts that require sales intervention and are retained but excluded
from log-scale analyses. Second, disbursed amount and TPV target are heavily
right-skewed (skewness above 2), driven by a small number of large-ticket
loans. This makes the mean misleading as a summary statistic and necessitates
log transformation in the regression model. No values were deleted; all
outliers represent legitimate business observations.
# Technique 2 - Data Visualisation
## Theory recap
The grammar of graphics frames a chart as the layered combination of data,
aesthetic mappings and geometric objects. Different geometric choices suit
different tasks: histograms for distributions, boxplots for group comparisons,
scatterplots for two-variable relationships, and heatmaps for two categorical
variables crossed. The five plots below are the same chart types used in the
monthly performance deck presented to executive management.
## Business justification
These five plots reflect the standard composition of my monthly performance
deck: a portfolio-distribution slide, a trajectory-over-time slide, a
geographic-comparison slide, a driver-relationship slide and a summary heatmap.
Each is designed to communicate to a non-analyst executive audience.
## Plot 1 - Distribution of disbursed amount
```{r viz-1}
df |>
dplyr::filter(disbursed_amount > 0) |>
ggplot(aes(disbursed_amount)) +
geom_histogram(bins = 35, fill = "#2c7fb8", alpha = 0.85) +
scale_x_log10(labels = label_comma()) +
labs(
title = "Plot 1 - Distribution of disbursed amount (log10 scale)",
subtitle = "Loan size spans several orders of magnitude",
x = "Disbursed amount (NGN)", y = "Number of loans"
)
```
## Plot 2 - TPV trajectory Feb-Apr by loan status
```{r viz-2}
df_long |>
dplyr::filter(!is.na(tpv), tpv > 0) |>
ggplot(aes(month, tpv, group = merchant_id)) +
geom_line(alpha = 0.18, colour = "#08519c") +
scale_y_log10(labels = label_comma()) +
facet_wrap(~ loan_status) +
labs(
title = "Plot 2 - TPV trajectory Feb-Apr 2026 by loan status",
subtitle = "Each line is one loan; log scale on the y axis",
x = NULL, y = "Monthly TPV (log scale)"
)
```
## Plot 3 - April achievement % by state
```{r viz-3}
df_long |>
dplyr::filter(month == ymd("2026-04-01"), !is.na(pct_achieved)) |>
ggplot(aes(state, pct_achieved, fill = state)) +
geom_boxplot(alpha = 0.85, show.legend = FALSE, outlier.alpha = 0.4) +
scale_y_continuous(
labels = label_percent(scale = 1, accuracy = 1),
breaks = c(0, 100, 200, 400, 800)
) +
coord_cartesian(ylim = c(0, 800)) +
labs(
title = "Plot 3 - April 2026 achievement % by state",
subtitle = "Y axis capped at 800% for legibility",
x = NULL, y = "Achievement %"
)
```
## Plot 4 - Loan size vs TPV coloured by status
```{r viz-4}
df_long |>
dplyr::filter(month == ymd("2026-04-01"), tpv > 0, disbursed_amount > 0) |>
ggplot(aes(disbursed_amount, tpv,
colour = loan_status, size = terminal_count)) +
geom_point(alpha = 0.55) +
scale_x_log10(labels = label_comma()) +
scale_y_log10(labels = label_comma()) +
scale_colour_manual(values = c(
CLOSED = "#1b9e77",
ACTIVE = "#1f78b4",
IN_ARREARS = "#d95f02"
)) +
labs(
title = "Plot 4 - Disbursed amount vs April TPV (log-log)",
subtitle = "Larger loans deliver larger TPV; arrears loans cluster at lower TPV",
x = "Disbursed amount (NGN)", y = "April TPV (NGN)",
colour = "Status", size = "Terminals"
)
```
## Plot 5 - Mean April achievement by state and status
```{r viz-5}
df_long |>
dplyr::filter(month == ymd("2026-04-01"), !is.na(pct_achieved)) |>
group_by(state, loan_status) |>
summarise(mean_ach = mean(pct_achieved, na.rm = TRUE),
n = n(), .groups = "drop") |>
ggplot(aes(state, loan_status, fill = mean_ach)) +
geom_tile() +
geom_text(aes(label = sprintf("%.0f%%\n(n=%d)", mean_ach, n)),
colour = "white", size = 3.2) +
scale_fill_viridis_c(option = "C",
labels = label_percent(scale = 1)) +
labs(
title = "Plot 5 - Mean April achievement % by state and status",
x = NULL, y = NULL, fill = "Mean achievement"
)
```
## Plain-language interpretation
Plot 1 confirms the heavy right skew in loan sizes - most merchants received
smaller facilities but a tail of large loans dominates portfolio value. Plot 2
shows ACTIVE loans maintaining stable TPV trajectories while IN_ARREARS
merchants show declining patterns - a visual early-warning signal. Plot 3
reveals meaningful state-level differences in achievement %. Plot 4 confirms
the positive relationship between loan size and TPV, with arrears loans
clustering at lower TPV even after controlling visually for size. Plot 5
identifies the specific state-status combinations driving underperformance,
directly informing where the sales team should concentrate intervention.
# Technique 3 - Hypothesis Testing
## Theory recap
A hypothesis test formalises a comparison between an assumption about the world
(H0) and an alternative (H1). The test produces a p-value - the probability of
observing data as extreme as ours if H0 were true. A p-value below the
significance level (alpha = 0.05) leads to rejection of H0. Effect sizes
measure the practical magnitude of the difference, which p-values alone cannot
convey.
## Business justification
The two hypotheses correspond to live debates in the sales function: whether
achievement % genuinely differs across states (determining whether we apply
uniform or differentiated sales strategies), and whether arrears are randomly
distributed across states (determining whether credit policy needs geographic
recalibration).
## H1 - Does mean April achievement % differ across states?
H0: Mean achievement % is identical across all states.
H1: At least one state has a different mean achievement %.
Alpha = 0.05
```{r h1-assumptions}
apr <- df_long |>
dplyr::filter(month == ymd("2026-04-01"), !is.na(pct_achieved))
apr |>
group_by(state) |>
summarise(
n = n(),
mean_ach_pct = round(mean(pct_achieved), 1),
sd_ach_pct = round(sd(pct_achieved), 1),
shapiro_p = if (n() >= 3 & n() <= 5000)
round(shapiro.test(pct_achieved)$p.value, 4)
else NA_real_,
.groups = "drop"
) |>
gt() |>
tab_header(
title = "Group summary and Shapiro-Wilk normality by state",
subtitle = "Small Shapiro p means reject normality - use non-parametric test"
)
leveneTest(pct_achieved ~ state, data = apr)
```
```{r h1-kw}
kw <- kruskal.test(pct_achieved ~ state, data = apr)
kw
effectsize::rank_epsilon_squared(pct_achieved ~ state, data = apr)
```
```{r h1-posthoc}
apr |>
rstatix::dunn_test(pct_achieved ~ state,
p.adjust.method = "bonferroni") |>
dplyr::select(group1, group2, n1, n2, statistic, p, p.adj, p.adj.signif) |>
gt() |>
tab_header(title = "Post-hoc pairwise Dunn tests (Bonferroni adjusted)")
```
## H2 - Is loan status independent of state?
H0: State and loan status are statistically independent.
H1: State and loan status are associated.
Alpha = 0.05
```{r h2-chisq}
tbl <- with(df, table(state, loan_status))
tbl
chisq_result <- stats::chisq.test(tbl)
chisq_result
effectsize::cramers_v(tbl)
```
## Plain-language interpretation
**H1:** The Kruskal-Wallis test was chosen because the Shapiro-Wilk test
confirmed non-normality within several state groups. Where the result is
significant (p < 0.05), the post-hoc Dunn test with Bonferroni correction
identifies which specific state pairs drive the difference - justifying a
state-differentiated sales resource allocation strategy.
**H2:** The chi-squared test examines whether arrears are concentrated in
specific states or spread randomly. A significant result with a non-trivial
Cramer's V would indicate that state-level factors are driving arrears patterns,
warranting a geographic review of credit underwriting policy.
# Technique 4 - Correlation Analysis
## Theory recap
Correlation quantifies the strength and direction of a relationship between two
variables. Pearson's r measures linear association; Spearman's rho measures
monotonic association on ranks and is robust to skew - making it more
appropriate for the heavily right-skewed financial variables here. Partial
correlation isolates the relationship between two variables after removing the
effect of a control variable. None of these methods establish causation.
## Business justification
The key operational question is whether the inputs we control - terminal count
and loan size - genuinely move the outcomes we are measured on (TPV and
achievement %), or whether apparent relationships are confounded by underwriting
decisions. This directly informs the terminal deployment and upsizing strategy.
## Analysis
```{r corr-data}
numeric_vars <- df_long |>
dplyr::filter(month == ymd("2026-04-01")) |>
dplyr::select(disbursed_amount, instalment_amount_single, terminal_count,
tpv, pct_achieved, current_account_balance) |>
drop_na()
spearman_mat <- cor(numeric_vars, method = "spearman")
```
```{r corr-heatmap}
ggcorrplot(spearman_mat,
type = "lower",
lab = TRUE,
lab_size = 3,
digits = 2,
colors = c("#b2182b", "white", "#2166ac"),
title = "Spearman correlation heatmap - April 2026 snapshot")
```
```{r corr-table}
spearman_mat |>
as.data.frame() |>
rownames_to_column("variable") |>
gt() |>
fmt_number(where(is.numeric), decimals = 2) |>
tab_header(title = "Spearman correlation matrix")
```
```{r corr-partial}
if (requireNamespace("ppcor", quietly = TRUE)) {
tryCatch(
ppcor::pcor.test(
numeric_vars$terminal_count,
numeric_vars$tpv,
numeric_vars$disbursed_amount,
method = "spearman"
),
error = function(e) message("Partial correlation skipped: ", e$message)
)
} else {
message("Package ppcor not installed - skipping partial correlation.")
}
```
## Plain-language interpretation
The three strongest correlations and their business implications:
1. **Disbursed amount and TPV:** The strongest positive relationship - larger
loans are associated with higher transaction volumes. Raw TPV comparisons across
merchants are misleading without controlling for loan size.
2. **Current account balance and TPV:** Merchants with healthier balances
process more transactions - consistent with financial health predicting
commercial performance. Account balance monitoring is a viable early-warning
indicator.
3. **Terminal count and TPV:** The partial correlation result (controlling for
disbursed amount) indicates whether additional terminal deployment drives TPV
independently of loan size - the key commercial question for the sales team.
Correlation does not imply causation. Deploying more terminals does not
automatically cause higher TPV if the merchant lacks customer volume to use
them. The regression model isolates the conditional effect of each variable.
# Technique 5 - Regression
## Theory recap
Ordinary least squares regression models the conditional mean of a continuous
outcome as a linear function of predictors. Each coefficient estimates the
change in the outcome for a one-unit increase in that predictor, holding others
constant. Where heavy skew is present, logarithmic transformation linearises
the relationship and stabilises variance. Logistic regression performs the
equivalent task for a binary outcome.
## Business justification
The OLS model answers the headline question: holding loan size, geography,
status and account balance constant, which factors materially move April TPV
and by how much? Each significant coefficient translates into a concrete sales
or credit recommendation. The logistic model identifies the strongest
predictors of arrears risk, informing underwriting policy.
## Primary model - OLS on log April TPV
```{r reg-data}
model_data <- df_long |>
dplyr::filter(
month == ymd("2026-04-01"),
tpv > 0,
disbursed_amount > 0
) |>
mutate(
log_tpv = log(tpv),
log_disb = log(disbursed_amount),
log_cab = log(current_account_balance + 1),
state = droplevels(factor(state)),
loan_status = droplevels(factor(loan_status))
)
ols <- lm(
log_tpv ~ log_disb + terminal_count + state + loan_status + log_cab,
data = model_data
)
```
```{r reg-coefficients}
broom::tidy(ols, conf.int = TRUE) |>
mutate(
across(where(is.numeric), ~ round(.x, 3)),
signif = case_when(
p.value < 0.001 ~ "***",
p.value < 0.01 ~ "**",
p.value < 0.05 ~ "*",
p.value < 0.1 ~ ".",
TRUE ~ ""
)
) |>
gt() |>
tab_header(
title = "OLS coefficients - log(April TPV) model",
subtitle = "Signif. codes: *** <.001 ** <.01 * <.05"
)
broom::glance(ols) |>
gt() |>
fmt_number(where(is.numeric), decimals = 3) |>
tab_header(title = "Model fit statistics")
```
```{r reg-diagnostics}
#| fig-height: 7
par(mfrow = c(2, 2))
plot(ols)
par(mfrow = c(1, 1))
```
```{r reg-vif-bp}
car::vif(ols)
lmtest::bptest(ols)
lmtest::coeftest(ols, vcov. = sandwich::vcovHC(ols, type = "HC3"))
```
## Secondary model - Logistic on P(IN_ARREARS)
```{r reg-logit}
logit_data <- df_long |>
dplyr::filter(
month == ymd("2026-04-01"),
disbursed_amount > 0
) |>
mutate(
arrears = as.integer(loan_status == "IN_ARREARS"),
log_disb = log(disbursed_amount),
log_cab = log(current_account_balance + 1),
state = droplevels(factor(state))
)
logit <- glm(
arrears ~ log_disb + terminal_count + state + log_cab,
data = logit_data,
family = binomial()
)
broom::tidy(logit, exponentiate = TRUE, conf.int = TRUE) |>
mutate(across(where(is.numeric), ~ round(.x, 3))) |>
gt() |>
tab_header(
title = "Logistic regression - odds ratios for P(IN_ARREARS)",
subtitle = "Coefficients exponentiated to odds-ratio scale"
)
pred <- predict(logit, type = "response")
roc_obj <- pROC::roc(logit_data$arrears, pred, quiet = TRUE)
cat("AUC =", round(pROC::auc(roc_obj), 3), "\n")
```
## Plain-language interpretation
**OLS model (R-squared = 0.67):** The model explains 67% of variation in April
TPV - strong performance for a behavioural outcome. Key findings:
- **log_disb (***): A 1% increase in disbursed amount is associated with a
0.87% increase in April TPV, after controlling for all other factors. Merchants
with larger facilities consistently generate more transaction volume - justifying
the underwriting team's use of loan size as a primary TPV predictor.
- **log_cab (***): Current account balance is the second strongest predictor.
A merchant whose balance is 10x higher than another processes materially more
TPV, holding loan size constant. This confirms that account balance monitoring
should be a standing item in the monthly sales review - a declining balance is
an early warning of future TPV underperformance.
- **loan_statusIN_ARREARS (*): IN_ARREARS merchants generate significantly
lower TPV than CLOSED merchants (the reference group), even after controlling
for loan size and balance. This quantifies the commercial cost of arrears
beyond the credit loss itself - arrears merchants also underperform on revenue
generation.
- **terminal_count and state:** Neither reaches statistical significance after
controlling for loan size and balance, suggesting that the apparent
state-level and terminal-count effects seen in the visualisations are largely
explained by differences in loan size and financial health across those groups.
**Logistic model:** The AUC measures the model's ability to discriminate
between arrears and non-arrears loans. The odds ratios identify which variables
increase or decrease the probability of arrears - directly actionable by the
credit underwriting team when setting loan conditions.
# Integrated Findings
## How the five analyses connect
The five analyses form a coherent analytical chain. EDA established the data
quality baseline and identified the heavy right skew that governed all
subsequent technique choices - log transformation in regression and Spearman
correlation over Pearson. Visualisation surfaced the commercially important
patterns: declining TPV trajectories for arrears merchants, state-level
performance gaps, and the positive size-TPV relationship. Hypothesis testing
confirmed that state-level differences are statistically significant and not
sampling noise - providing the evidential standard needed to justify a
differentiated geographic sales strategy. Correlation identified the strongest
pairwise relationships and confirmed that disbursed amount and current account
balance are the dominant drivers. Regression isolated the conditional effect of
each driver: log_disb and log_cab together explain 67% of TPV variance, while
terminal count and state lose significance once these two are controlled for.
## The single actionable recommendation
On the basis of these five analyses, I recommend that in the next monthly
business review the sales team implement a **two-tier merchant intervention
protocol**. Tier 1 - merchants with disbursed amounts above the portfolio
median AND current account balances that have declined for two consecutive
months - should receive an immediate sales visit to diagnose TPV shortfall and
be assessed for early-arrears referral to the credit team. Tier 2 - merchants
with high disbursed amounts and stable or growing balances - should be
prioritised for upsizing conversations, as the regression confirms these are
the merchants whose TPV scales most reliably with additional facility size.
This protocol operationalises the two significant regression coefficients
directly into the monthly sales review process.
# Limitations and Further Work
- **Three-month window:** February-April 2026 is too short to separate
seasonality from underlying trend.
- **Repeat-loan rows:** A small number of merchants appear more than once.
The regression treats these as independent observations; clustered standard
errors at the merchant level would be more conservative.
- **months_on_book:** This variable was all NA after reshaping due to missing
disbursement date parsing for some records, and was excluded from the
regression. A cleaner date field would allow vintage effects to be modelled.
- **Unobserved heterogeneity:** Business sector, channel mix and merchant
tenure are absent from the extract.
**Further work:** With more time I would extend to a 12-month panel with
merchant fixed effects, add underwriting variables, reserve a holdout sample
to evaluate the logistic model out-of-sample, and apply a survival model to
account for the censored nature of active loans.
# References
Adi, B. (2026). *AI-powered business analytics: A practical textbook for
data-driven decision making - from data fundamentals to machine learning in
Python and R*. Lagos Business School / markanalytics.online.
https://markanalytics.online
R Core Team. (2024). *R: A language and environment for statistical computing*
(Version 4.6). R Foundation for Statistical Computing. https://www.R-project.org/
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., Francois, R.,
Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L.,
Miller, E., Bache, S. M., Muller, K., Ooms, J., Robinson, D., Seidel, D. P.,
Spinu, V., & Yutani, H. (2019). Welcome to the tidyverse. *Journal of Open
Source Software, 4*(43), 1686. https://doi.org/10.21105/joss.01686
Wickham, H. (2016). *ggplot2: Elegant graphics for data analysis*. Springer.
https://doi.org/10.1007/978-3-319-24277-4
Onosade, G. (2026). *Merchant POS loan performance dataset - February to April
2026* [Dataset]. Collected from loan management and merchant performance system,
Lagos, Nigeria. Data available on request from the author.
# Appendix - AI Usage Statement
Claude (Anthropic, claude.ai) was used to assist this study in two ways.
First, it helped audit the structure of the data extract, identify column-naming
discrepancies between the template and the actual CSV, and generate corrected R
code for the data cleaning and reshaping pipeline. Second, it drafted
boilerplate code for the visualisation, hypothesis testing, correlation and
regression sections, which I reviewed and verified against my own understanding
of the data and business context. All analytical decisions - the choice of case
study, the hypotheses tested, the model specification, the interpretation of
every result, and the integrated recommendation - are my own, made in line with
my professional judgement as Sales and Business Director. The AI was used as a
coding and editing assistant; no AI-generated interpretation appears in this
document without my independent review and professional validation.