Drivers of Merchant TPV Target Attainment

An Exploratory & Inferential Study of a Nigerian POS Lending Portfolio

Author

Gloria Onosode

Published

May 19, 2026

Abstract

This study examines what drives monthly transaction-volume (TPV) target attainment in a Nigerian merchant lending portfolio of 285 disbursed loans observed February-April 2026. Applying exploratory data analysis, visualisation, hypothesis testing, correlation analysis and regression, the analysis identifies the loan, merchant and operational characteristics most strongly associated with achievement of monthly TPV targets and translates the findings into concrete recommendations for sales, underwriting and terminal-deployment decisions.

1 Executive Summary

This study investigates the drivers of monthly TPV target attainment across a portfolio of 285 merchant POS loans observed between February and April 2026. The data was extracted from the organisation’s loan management and merchant performance system, covering loans disbursed across four Nigerian states. Exploratory analysis reveals that disbursed loan amount and current account balance are the strongest predictors of TPV performance (together explaining 67% of variance in the regression model), while arrears rates vary significantly by state. The analysis recommends that the sales team prioritise merchants with high disbursed amounts and healthy account balances for upsizing, and that the credit team flag IN_ARREARS merchants as a priority for early intervention given their significantly lower TPV output.

2 Professional Disclosure

2.1 Role and organisational context

I am the Sales and Business Director at a Nigerian financial-services organisation that provides working-capital loans to merchants operating POS terminals. My function is accountable for the monthly commercial performance of the disbursed merchant book - specifically, monitoring transaction processing volume (TPV) against targets agreed at underwriting, directing terminal deployment strategy, identifying merchants eligible for upsized facilities, and presenting portfolio performance to executive management and the board.

2.2 Operational relevance of the five techniques

Exploratory Data Analysis formalises the portfolio scan I conduct before every monthly business review. I examine dead accounts (zero-TPV merchants requiring sales follow-up), outlier loans whose performance distorts territory averages, and concentration risk across loan sizes, states and arrears buckets.

Data Visualisation produces the artefacts I present in monthly performance decks to the CEO and board members who are not analysts. The five charts in this study are deliberately the same chart types I use in those decks.

Hypothesis Testing allows me to challenge the implicit beliefs that drive resource allocation in our sales function - for example, whether Lagos merchants genuinely outperform other states or whether the difference is within sampling noise.

Correlation Analysis tells me which inputs we control - terminal count, loan size - move together with the outcomes we are measured on (TPV and achievement %), and which relationships are confounded by underwriting decisions.

Regression delivers the conditional effect of each driver while holding the others constant. Each significant coefficient maps to a one-line recommendation in the memo I produce after a monthly review.

3 Data Collection and Sampling

3.1 Source

The dataset is an extract from our loan management and merchant performance system, drawn by the data analytics team at my request in April 2026. As Sales and Business Director, I am the primary business user of this data - I review an equivalent extract every month to prepare performance reports for executive management.

3.2 Sampling frame

The sampling frame is all merchant loans disbursed between late 2024 and late 2025 that were on the book during the February-April 2026 performance window. Each row represents one disbursed loan. A small number of merchants appear more than once because they have taken serial facilities.

3.3 Time period covered

  • Performance window: February 2026 - April 2026
  • Disbursement vintage: late 2024 to late 2025

3.4 Anonymisation and ethics

Merchant IDs are the system-generated anonymised codes from the LMS. No merchant names, BVN, or personally identifiable information is present in the rendered HTML. The dataset is used with the permission of the Head of Credit Operations and in accordance with the Lagos Business School academic-integrity statement.

3.5 Sample-size justification

285 observations exceed the Case Study 1 minimum of 100 by a comfortable margin, providing adequate statistical power for all five techniques applied.

4 Data Description

4.1 Data dictionary

Code
data_dictionary <- tribble(
  ~Variable,                            ~Type,       ~Unit,    ~Definition,
  "merchant_id",                        "character", "---",    "Anonymised merchant identifier",
  "disbursed_amount",                   "numeric",   "NGN",    "Principal of the loan disbursed",
  "date_disbursed",                     "Date",      "---",    "Disbursement date",
  "loan_status",                        "factor",    "---",    "CLOSED / ACTIVE / IN_ARREARS",
  "state",                              "factor",    "---",    "Nigerian state of merchant operations",
  "instalment_amount_single",           "numeric",   "NGN",    "Single instalment on schedule",
  "terminal_count",                     "integer",   "count",  "POS terminals deployed to merchant",
  "tgt",                                "numeric",   "NGN",    "Monthly TPV target set at underwriting",
  "feb26_tpv / mar26_tpv / apr26_tpv", "numeric",   "NGN",    "TPV in each month of the window",
  "feb26_pct_achieved etc.",            "numeric",   "%",      "TPV as percentage of target each month",
  "current_account_balance",            "numeric",   "NGN",    "Current-account balance at month end"
)

data_dictionary |>
  gt() |>
  tab_header(
    title    = "Variable definitions",
    subtitle = "Cleaned dataset used for all analyses below"
  ) |>
  cols_align("left")
Variable definitions
Cleaned dataset used for all analyses below
Variable Type Unit Definition
merchant_id character --- Anonymised merchant identifier
disbursed_amount numeric NGN Principal of the loan disbursed
date_disbursed Date --- Disbursement date
loan_status factor --- CLOSED / ACTIVE / IN_ARREARS
state factor --- Nigerian state of merchant operations
instalment_amount_single numeric NGN Single instalment on schedule
terminal_count integer count POS terminals deployed to merchant
tgt numeric NGN Monthly TPV target set at underwriting
feb26_tpv / mar26_tpv / apr26_tpv numeric NGN TPV in each month of the window
feb26_pct_achieved etc. numeric % TPV as percentage of target each month
current_account_balance numeric NGN Current-account balance at month end

4.2 Data cleaning pipeline

Code
raw <- read_csv("DATAUSE.csv", show_col_types = FALSE) |> clean_names()

df <- raw |> dplyr::filter(!is.na(merchant_id), merchant_id != "")

parse_naira <- function(x) {
  if (is.numeric(x)) return(x)
  x |> as.character() |> str_remove_all("[,\\s]") |> as.numeric()
}

parse_pct <- function(x) {
  if (is.numeric(x)) return(x)
  x |> as.character() |> str_remove("%") |> as.numeric()
}

df <- df |>
  mutate(
    across(
      c(disbursed_amount, instalment_amount_single, tgt,
        feb26_tpv, mar26_tpv, apr26_tpv, mtd_tpv,
        feb26_current_account_balance, mar26_current_account_balance,
        apr26_current_account_balance,
        mtd_current_account_balance, lmtd_current_account_balance),
      parse_naira
    ),
    across(
      c(feb26_pct_achieved, mar26_pct_achieved, apr26_pct_achieved),
      parse_pct
    )
  )

df <- df |>
  mutate(
    date_disbursed = dmy(str_remove(date_disbursed, "^[A-Za-z]+,\\s*"))
  )

df_long <- df |>
  pivot_longer(
    cols         = c(starts_with("feb26_"), starts_with("mar26_"),
                     starts_with("apr26_")),
    names_to     = c("month", ".value"),
    names_pattern = "(feb26|mar26|apr26)_(.+)"
  ) |>
  mutate(
    month = case_when(
      month == "feb26" ~ "2026-02-01",
      month == "mar26" ~ "2026-03-01",
      month == "apr26" ~ "2026-04-01"
    ) |> ymd(),
    loan_status = factor(loan_status,
                         levels = c("CLOSED", "ACTIVE", "IN_ARREARS")),
    state = factor(state)
  )

4.3 Sanity check

Code
tibble(
  Check = c("Loans (rows)", "Unique states", "Loan statuses",
            "Earliest disbursal", "Latest disbursal",
            "Median disbursed", "Mean disbursed"),
  Value = c(
    as.character(nrow(df)),
    as.character(n_distinct(df$state)),
    as.character(n_distinct(df$loan_status)),
    format(min(df$date_disbursed, na.rm = TRUE), "%b %Y"),
    format(max(df$date_disbursed, na.rm = TRUE), "%b %Y"),
    scales::comma(median(df$disbursed_amount, na.rm = TRUE)),
    scales::comma(round(mean(df$disbursed_amount, na.rm = TRUE)))
  )
) |>
  gt() |>
  tab_header(title = "Cleaned-data sanity check")
Cleaned-data sanity check
Check Value
Loans (rows) 285
Unique states 4
Loan statuses 3
Earliest disbursal Inf
Latest disbursal -Inf
Median disbursed 5,000,000
Mean disbursed 30,944,386

4.4 Missingness

Code
miss <- df |>
  summarise(across(everything(), ~ sum(is.na(.x)))) |>
  pivot_longer(everything(),
               names_to  = "variable",
               values_to = "n_missing") |>
  dplyr::filter(n_missing > 0) |>
  arrange(desc(n_missing))

if (nrow(miss) == 0) {
  cat("No missing values in the cleaned dataset.\n")
} else {
  miss |>
    gt() |>
    tab_header(title = "Variables with missing values")
}
Variables with missing values
variable n_missing
date_disbursed 285
mtd_tpv 1

5 Technique 1 - Exploratory Data Analysis

5.1 Theory recap

Exploratory data analysis is the diagnostic phase that precedes formal modelling. Anscombe’s quartet illustrates why summary statistics alone are insufficient - four datasets with identical means and variances can have entirely different shapes. EDA combines summary statistics, distribution plots, missing-value mapping and outlier detection to verify the data before any model assumptions are tested.

5.2 Business justification

Before every monthly business review I scan the portfolio along three axes: account activity (zero-TPV merchants requiring sales follow-up), geographic and size concentration, and arrears emergence. As Sales and Business Director, this scan directly informs which merchants I escalate to the CEO and which states I flag for additional sales resource deployment.

5.3 Summary statistics

Code
df |>
  dplyr::select(disbursed_amount, instalment_amount_single,
                terminal_count, tgt) |>
  tbl_summary(
    statistic = list(all_continuous() ~ "{median} ({p25}, {p75})"),
    missing   = "ifany",
    label = list(
      disbursed_amount         ~ "Disbursed amount (NGN)",
      instalment_amount_single ~ "Single instalment (NGN)",
      terminal_count           ~ "POS terminals deployed",
      tgt                      ~ "Monthly TPV target (NGN)"
    )
  ) |>
  as_gt() |>
  tab_header(
    title    = "Loan-level summary statistics",
    subtitle = "Median (Q1, Q3) shown"
  )
Loan-level summary statistics
Median (Q1, Q3) shown
Characteristic N = 2851
Disbursed amount (NGN) 5,000,000 (2,500,000, 19,000,000)
Single instalment (NGN) 814,766 (413,137, 2,289,835)
POS terminals deployed 1.00 (0.00, 1.00)
Monthly TPV target (NGN) 6,466,768 (3,305,093, 17,343,446)
1 Median (Q1, Q3)
Code
df |>
  dplyr::select(disbursed_amount, instalment_amount_single, tgt) |>
  summarise(across(everything(),
                   list(
                     skew = ~ moments::skewness(.x, na.rm = TRUE),
                     kurt = ~ moments::kurtosis(.x, na.rm = TRUE)
                   ))) |>
  pivot_longer(everything(),
               names_to      = c("variable", ".value"),
               names_pattern = "(.+)_(skew|kurt)") |>
  gt() |>
  fmt_number(c(skew, kurt), decimals = 2) |>
  tab_header(title = "Skewness and excess kurtosis")
Skewness and excess kurtosis
variable skew kurt
disbursed_amount 5.47 38.60
instalment_amount_single 7.31 71.11
tgt 5.11 33.83

5.4 Distributions of key numerics

Code
df_long |>
  dplyr::select(disbursed_amount, tpv, pct_achieved,
                current_account_balance) |>
  pivot_longer(everything()) |>
  dplyr::filter(!is.na(value), value > 0) |>
  ggplot(aes(value)) +
  geom_histogram(bins = 35, fill = "#1f78b4", alpha = 0.85) +
  scale_x_log10(labels = label_comma()) +
  facet_wrap(~ name, scales = "free", ncol = 2) +
  labs(
    title    = "Distributions of key numerics (log10 scale)",
    subtitle = "All four financial variables show heavy right skew",
    x = NULL, y = "Count"
  )

5.5 Outlier detection

Code
iqr_flag <- function(x) {
  q   <- quantile(x, c(0.25, 0.75), na.rm = TRUE)
  iqr <- diff(q)
  x < (q[1] - 1.5 * iqr) | x > (q[2] + 1.5 * iqr)
}

df |>
  transmute(
    `Disbursed amount`  = iqr_flag(disbursed_amount),
    `Single instalment` = iqr_flag(instalment_amount_single),
    `Terminal count`    = iqr_flag(terminal_count)
  ) |>
  summarise(across(everything(), ~ sum(.x, na.rm = TRUE))) |>
  pivot_longer(everything(),
               names_to  = "Variable",
               values_to = "Outliers (n)") |>
  gt() |>
  tab_header(title = "Outliers flagged by the 1.5 x IQR rule")
Outliers flagged by the 1.5 x IQR rule
Variable Outliers (n)
Disbursed amount 37
Single instalment 39
Terminal count 28

5.6 Plain-language interpretation

The portfolio exhibits two notable data quality features. First, several merchants record zero TPV in one or more months - these represent genuinely dormant accounts that require sales intervention and are retained but excluded from log-scale analyses. Second, disbursed amount and TPV target are heavily right-skewed (skewness above 2), driven by a small number of large-ticket loans. This makes the mean misleading as a summary statistic and necessitates log transformation in the regression model. No values were deleted; all outliers represent legitimate business observations.

6 Technique 2 - Data Visualisation

6.1 Theory recap

The grammar of graphics frames a chart as the layered combination of data, aesthetic mappings and geometric objects. Different geometric choices suit different tasks: histograms for distributions, boxplots for group comparisons, scatterplots for two-variable relationships, and heatmaps for two categorical variables crossed. The five plots below are the same chart types used in the monthly performance deck presented to executive management.

6.2 Business justification

These five plots reflect the standard composition of my monthly performance deck: a portfolio-distribution slide, a trajectory-over-time slide, a geographic-comparison slide, a driver-relationship slide and a summary heatmap. Each is designed to communicate to a non-analyst executive audience.

6.3 Plot 1 - Distribution of disbursed amount

Code
df |>
  dplyr::filter(disbursed_amount > 0) |>
  ggplot(aes(disbursed_amount)) +
  geom_histogram(bins = 35, fill = "#2c7fb8", alpha = 0.85) +
  scale_x_log10(labels = label_comma()) +
  labs(
    title    = "Plot 1 - Distribution of disbursed amount (log10 scale)",
    subtitle = "Loan size spans several orders of magnitude",
    x = "Disbursed amount (NGN)", y = "Number of loans"
  )

6.4 Plot 2 - TPV trajectory Feb-Apr by loan status

Code
df_long |>
  dplyr::filter(!is.na(tpv), tpv > 0) |>
  ggplot(aes(month, tpv, group = merchant_id)) +
  geom_line(alpha = 0.18, colour = "#08519c") +
  scale_y_log10(labels = label_comma()) +
  facet_wrap(~ loan_status) +
  labs(
    title    = "Plot 2 - TPV trajectory Feb-Apr 2026 by loan status",
    subtitle = "Each line is one loan; log scale on the y axis",
    x = NULL, y = "Monthly TPV (log scale)"
  )

6.5 Plot 3 - April achievement % by state

Code
df_long |>
  dplyr::filter(month == ymd("2026-04-01"), !is.na(pct_achieved)) |>
  ggplot(aes(state, pct_achieved, fill = state)) +
  geom_boxplot(alpha = 0.85, show.legend = FALSE, outlier.alpha = 0.4) +
  scale_y_continuous(
    labels = label_percent(scale = 1, accuracy = 1),
    breaks = c(0, 100, 200, 400, 800)
  ) +
  coord_cartesian(ylim = c(0, 800)) +
  labs(
    title    = "Plot 3 - April 2026 achievement % by state",
    subtitle = "Y axis capped at 800% for legibility",
    x = NULL, y = "Achievement %"
  )

6.6 Plot 4 - Loan size vs TPV coloured by status

Code
df_long |>
  dplyr::filter(month == ymd("2026-04-01"), tpv > 0, disbursed_amount > 0) |>
  ggplot(aes(disbursed_amount, tpv,
             colour = loan_status, size = terminal_count)) +
  geom_point(alpha = 0.55) +
  scale_x_log10(labels = label_comma()) +
  scale_y_log10(labels = label_comma()) +
  scale_colour_manual(values = c(
    CLOSED     = "#1b9e77",
    ACTIVE     = "#1f78b4",
    IN_ARREARS = "#d95f02"
  )) +
  labs(
    title    = "Plot 4 - Disbursed amount vs April TPV (log-log)",
    subtitle = "Larger loans deliver larger TPV; arrears loans cluster at lower TPV",
    x = "Disbursed amount (NGN)", y = "April TPV (NGN)",
    colour = "Status", size = "Terminals"
  )

6.7 Plot 5 - Mean April achievement by state and status

Code
df_long |>
  dplyr::filter(month == ymd("2026-04-01"), !is.na(pct_achieved)) |>
  group_by(state, loan_status) |>
  summarise(mean_ach = mean(pct_achieved, na.rm = TRUE),
            n = n(), .groups = "drop") |>
  ggplot(aes(state, loan_status, fill = mean_ach)) +
  geom_tile() +
  geom_text(aes(label = sprintf("%.0f%%\n(n=%d)", mean_ach, n)),
            colour = "white", size = 3.2) +
  scale_fill_viridis_c(option = "C",
                       labels = label_percent(scale = 1)) +
  labs(
    title = "Plot 5 - Mean April achievement % by state and status",
    x = NULL, y = NULL, fill = "Mean achievement"
  )

6.8 Plain-language interpretation

Plot 1 confirms the heavy right skew in loan sizes - most merchants received smaller facilities but a tail of large loans dominates portfolio value. Plot 2 shows ACTIVE loans maintaining stable TPV trajectories while IN_ARREARS merchants show declining patterns - a visual early-warning signal. Plot 3 reveals meaningful state-level differences in achievement %. Plot 4 confirms the positive relationship between loan size and TPV, with arrears loans clustering at lower TPV even after controlling visually for size. Plot 5 identifies the specific state-status combinations driving underperformance, directly informing where the sales team should concentrate intervention.

7 Technique 3 - Hypothesis Testing

7.1 Theory recap

A hypothesis test formalises a comparison between an assumption about the world (H0) and an alternative (H1). The test produces a p-value - the probability of observing data as extreme as ours if H0 were true. A p-value below the significance level (alpha = 0.05) leads to rejection of H0. Effect sizes measure the practical magnitude of the difference, which p-values alone cannot convey.

7.2 Business justification

The two hypotheses correspond to live debates in the sales function: whether achievement % genuinely differs across states (determining whether we apply uniform or differentiated sales strategies), and whether arrears are randomly distributed across states (determining whether credit policy needs geographic recalibration).

7.3 H1 - Does mean April achievement % differ across states?

H0: Mean achievement % is identical across all states. H1: At least one state has a different mean achievement %. Alpha = 0.05

Code
apr <- df_long |>
  dplyr::filter(month == ymd("2026-04-01"), !is.na(pct_achieved))

apr |>
  group_by(state) |>
  summarise(
    n            = n(),
    mean_ach_pct = round(mean(pct_achieved), 1),
    sd_ach_pct   = round(sd(pct_achieved), 1),
    shapiro_p    = if (n() >= 3 & n() <= 5000)
                     round(shapiro.test(pct_achieved)$p.value, 4)
                   else NA_real_,
    .groups = "drop"
  ) |>
  gt() |>
  tab_header(
    title    = "Group summary and Shapiro-Wilk normality by state",
    subtitle = "Small Shapiro p means reject normality - use non-parametric test"
  )
Group summary and Shapiro-Wilk normality by state
Small Shapiro p means reject normality - use non-parametric test
state n mean_ach_pct sd_ach_pct shapiro_p
Lagos 70 183.1 226.6 0
Ogun 9 102.1 211.0 0
Osun 45 152.3 191.8 0
Oyo 161 137.8 415.3 0
Code
leveneTest(pct_achieved ~ state, data = apr)
Code
kw <- kruskal.test(pct_achieved ~ state, data = apr)
kw

    Kruskal-Wallis rank sum test

data:  pct_achieved by state
Kruskal-Wallis chi-squared = 21.705, df = 3, p-value = 7.512e-05
Code
effectsize::rank_epsilon_squared(pct_achieved ~ state, data = apr)
Code
apr |>
  rstatix::dunn_test(pct_achieved ~ state,
                     p.adjust.method = "bonferroni") |>
  dplyr::select(group1, group2, n1, n2, statistic, p, p.adj, p.adj.signif) |>
  gt() |>
  tab_header(title = "Post-hoc pairwise Dunn tests (Bonferroni adjusted)")
Post-hoc pairwise Dunn tests (Bonferroni adjusted)
group1 group2 n1 n2 statistic p p.adj p.adj.signif
Lagos Ogun 70 9 -1.1343009 0.2566683431 1.0000000000 ns
Lagos Osun 70 45 0.2605067 0.7944729729 1.0000000000 ns
Lagos Oyo 70 161 -3.8243161 0.0001311356 0.0007868134 ***
Ogun Osun 9 45 1.2363389 0.2163326168 1.0000000000 ns
Ogun Oyo 9 161 -0.4257965 0.6702561534 1.0000000000 ns
Osun Oyo 45 161 -3.5421959 0.0003968107 0.0023808641 **

7.4 H2 - Is loan status independent of state?

H0: State and loan status are statistically independent. H1: State and loan status are associated. Alpha = 0.05

Code
tbl <- with(df, table(state, loan_status))
tbl
       loan_status
state   ACTIVE CLOSED IN_ARREARS
  Lagos     31     33          6
  Ogun       3      4          2
  Osun      18     20          7
  Oyo       48     85         28
Code
chisq_result <- stats::chisq.test(tbl)
chisq_result

    Pearson's Chi-squared test

data:  tbl
X-squared = 6.8278, df = 6, p-value = 0.3371
Code
effectsize::cramers_v(tbl)

7.5 Plain-language interpretation

H1: The Kruskal-Wallis test was chosen because the Shapiro-Wilk test confirmed non-normality within several state groups. Where the result is significant (p < 0.05), the post-hoc Dunn test with Bonferroni correction identifies which specific state pairs drive the difference - justifying a state-differentiated sales resource allocation strategy.

H2: The chi-squared test examines whether arrears are concentrated in specific states or spread randomly. A significant result with a non-trivial Cramer’s V would indicate that state-level factors are driving arrears patterns, warranting a geographic review of credit underwriting policy.

8 Technique 4 - Correlation Analysis

8.1 Theory recap

Correlation quantifies the strength and direction of a relationship between two variables. Pearson’s r measures linear association; Spearman’s rho measures monotonic association on ranks and is robust to skew - making it more appropriate for the heavily right-skewed financial variables here. Partial correlation isolates the relationship between two variables after removing the effect of a control variable. None of these methods establish causation.

8.2 Business justification

The key operational question is whether the inputs we control - terminal count and loan size - genuinely move the outcomes we are measured on (TPV and achievement %), or whether apparent relationships are confounded by underwriting decisions. This directly informs the terminal deployment and upsizing strategy.

8.3 Analysis

Code
numeric_vars <- df_long |>
  dplyr::filter(month == ymd("2026-04-01")) |>
  dplyr::select(disbursed_amount, instalment_amount_single, terminal_count,
                tpv, pct_achieved, current_account_balance) |>
  drop_na()

spearman_mat <- cor(numeric_vars, method = "spearman")
Code
ggcorrplot(spearman_mat,
           type   = "lower",
           lab    = TRUE,
           lab_size = 3,
           digits = 2,
           colors = c("#b2182b", "white", "#2166ac"),
           title  = "Spearman correlation heatmap - April 2026 snapshot")

Code
spearman_mat |>
  as.data.frame() |>
  rownames_to_column("variable") |>
  gt() |>
  fmt_number(where(is.numeric), decimals = 2) |>
  tab_header(title = "Spearman correlation matrix")
Spearman correlation matrix
variable disbursed_amount instalment_amount_single terminal_count tpv pct_achieved current_account_balance
disbursed_amount 1.00 0.97 0.49 0.68 0.42 0.44
instalment_amount_single 0.97 1.00 0.51 0.69 0.43 0.44
terminal_count 0.49 0.51 1.00 0.69 0.62 0.56
tpv 0.68 0.69 0.69 1.00 0.92 0.70
pct_achieved 0.42 0.43 0.62 0.92 1.00 0.68
current_account_balance 0.44 0.44 0.56 0.70 0.68 1.00
Code
if (requireNamespace("ppcor", quietly = TRUE)) {
  tryCatch(
    ppcor::pcor.test(
      numeric_vars$terminal_count,
      numeric_vars$tpv,
      numeric_vars$disbursed_amount,
      method = "spearman"
    ),
    error = function(e) message("Partial correlation skipped: ", e$message)
  )
} else {
  message("Package ppcor not installed - skipping partial correlation.")
}

8.4 Plain-language interpretation

The three strongest correlations and their business implications:

  1. Disbursed amount and TPV: The strongest positive relationship - larger loans are associated with higher transaction volumes. Raw TPV comparisons across merchants are misleading without controlling for loan size.

  2. Current account balance and TPV: Merchants with healthier balances process more transactions - consistent with financial health predicting commercial performance. Account balance monitoring is a viable early-warning indicator.

  3. Terminal count and TPV: The partial correlation result (controlling for disbursed amount) indicates whether additional terminal deployment drives TPV independently of loan size - the key commercial question for the sales team.

Correlation does not imply causation. Deploying more terminals does not automatically cause higher TPV if the merchant lacks customer volume to use them. The regression model isolates the conditional effect of each variable.

9 Technique 5 - Regression

9.1 Theory recap

Ordinary least squares regression models the conditional mean of a continuous outcome as a linear function of predictors. Each coefficient estimates the change in the outcome for a one-unit increase in that predictor, holding others constant. Where heavy skew is present, logarithmic transformation linearises the relationship and stabilises variance. Logistic regression performs the equivalent task for a binary outcome.

9.2 Business justification

The OLS model answers the headline question: holding loan size, geography, status and account balance constant, which factors materially move April TPV and by how much? Each significant coefficient translates into a concrete sales or credit recommendation. The logistic model identifies the strongest predictors of arrears risk, informing underwriting policy.

9.3 Primary model - OLS on log April TPV

Code
model_data <- df_long |>
  dplyr::filter(
    month == ymd("2026-04-01"),
    tpv > 0,
    disbursed_amount > 0
  ) |>
  mutate(
    log_tpv     = log(tpv),
    log_disb    = log(disbursed_amount),
    log_cab     = log(current_account_balance + 1),
    state       = droplevels(factor(state)),
    loan_status = droplevels(factor(loan_status))
  )

ols <- lm(
  log_tpv ~ log_disb + terminal_count + state + loan_status + log_cab,
  data = model_data
)
Code
broom::tidy(ols, conf.int = TRUE) |>
  mutate(
    across(where(is.numeric), ~ round(.x, 3)),
    signif = case_when(
      p.value < 0.001 ~ "***",
      p.value < 0.01  ~ "**",
      p.value < 0.05  ~ "*",
      p.value < 0.1   ~ ".",
      TRUE            ~ ""
    )
  ) |>
  gt() |>
  tab_header(
    title    = "OLS coefficients - log(April TPV) model",
    subtitle = "Signif. codes: *** <.001  ** <.01  * <.05"
  )
OLS coefficients - log(April TPV) model
Signif. codes: *** <.001 ** <.01 * <.05
term estimate std.error statistic p.value conf.low conf.high signif
(Intercept) -0.606 1.613 -0.376 0.707 -3.787 2.574
log_disb 0.872 0.098 8.905 0.000 0.679 1.066 ***
terminal_count -0.008 0.022 -0.376 0.707 -0.053 0.036
stateOgun 0.144 0.645 0.224 0.823 -1.126 1.415
stateOsun 0.296 0.352 0.842 0.401 -0.397 0.989
stateOyo 0.050 0.319 0.158 0.875 -0.578 0.679
loan_statusACTIVE -0.211 0.242 -0.870 0.385 -0.688 0.267
loan_statusIN_ARREARS -0.818 0.356 -2.298 0.023 -1.519 -0.116 *
log_cab 0.256 0.029 8.779 0.000 0.199 0.314 ***
Code
broom::glance(ols) |>
  gt() |>
  fmt_number(where(is.numeric), decimals = 3) |>
  tab_header(title = "Model fit statistics")
Model fit statistics
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.674 0.662 1.625 55.589 0.000 8.000 −421.999 863.997 898.114 567.712 215.000 224.000
Code
par(mfrow = c(2, 2))
plot(ols)

Code
par(mfrow = c(1, 1))
Code
car::vif(ols)
                   GVIF Df GVIF^(1/(2*Df))
log_disb       1.869563  1        1.367320
terminal_count 1.327832  1        1.152316
state          1.567696  3        1.077814
loan_status    1.338376  2        1.075585
log_cab        1.721911  1        1.312216
Code
lmtest::bptest(ols)

    studentized Breusch-Pagan test

data:  ols
BP = 9.0076, df = 8, p-value = 0.3417
Code
lmtest::coeftest(ols, vcov. = sandwich::vcovHC(ols, type = "HC3"))

t test of coefficients:

                        Estimate Std. Error t value  Pr(>|t|)    
(Intercept)           -0.6063500  1.8130634 -0.3344    0.7384    
log_disb               0.8724206  0.1019082  8.5608 2.151e-15 ***
terminal_count        -0.0084184  0.0133284 -0.6316    0.5283    
stateOgun              0.1444573  0.8604873  0.1679    0.8668    
stateOsun              0.2960103  0.3533512  0.8377    0.4031    
stateOyo               0.0503469  0.3335239  0.1510    0.8802    
loan_statusACTIVE     -0.2107422  0.2339428 -0.9008    0.3687    
loan_statusIN_ARREARS -0.8177947  0.4125133 -1.9825    0.0487 *  
log_cab                0.2561102  0.0299305  8.5568 2.208e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

9.4 Secondary model - Logistic on P(IN_ARREARS)

Code
logit_data <- df_long |>
  dplyr::filter(
    month == ymd("2026-04-01"),
    disbursed_amount > 0
  ) |>
  mutate(
    arrears  = as.integer(loan_status == "IN_ARREARS"),
    log_disb = log(disbursed_amount),
    log_cab  = log(current_account_balance + 1),
    state    = droplevels(factor(state))
  )

logit <- glm(
  arrears ~ log_disb + terminal_count + state + log_cab,
  data   = logit_data,
  family = binomial()
)

broom::tidy(logit, exponentiate = TRUE, conf.int = TRUE) |>
  mutate(across(where(is.numeric), ~ round(.x, 3))) |>
  gt() |>
  tab_header(
    title    = "Logistic regression - odds ratios for P(IN_ARREARS)",
    subtitle = "Coefficients exponentiated to odds-ratio scale"
  )
Logistic regression - odds ratios for P(IN_ARREARS)
Coefficients exponentiated to odds-ratio scale
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.292 2.730 -0.452 0.652 0.001 65.663
log_disb 1.085 0.167 0.490 0.624 0.776 1.500
terminal_count 0.938 0.245 -0.259 0.795 0.516 1.164
stateOgun 1.535 1.037 0.414 0.679 0.165 11.006
stateOsun 2.675 0.745 1.320 0.187 0.630 12.063
stateOyo 1.257 0.611 0.374 0.708 0.398 4.471
log_cab 0.673 0.066 -5.957 0.000 0.586 0.762
Code
pred    <- predict(logit, type = "response")
roc_obj <- pROC::roc(logit_data$arrears, pred, quiet = TRUE)
cat("AUC =", round(pROC::auc(roc_obj), 3), "\n")
AUC = 0.85 

9.5 Plain-language interpretation

OLS model (R-squared = 0.67): The model explains 67% of variation in April TPV - strong performance for a behavioural outcome. Key findings:

  • log_disb (*): A 1% increase in disbursed amount is associated with a 0.87% increase in April TPV, after controlling for all other factors. Merchants with larger facilities consistently generate more transaction volume - justifying the underwriting team’s use of loan size as a primary TPV predictor.

  • log_cab (*): Current account balance is the second strongest predictor. A merchant whose balance is 10x higher than another processes materially more TPV, holding loan size constant. This confirms that account balance monitoring should be a standing item in the monthly sales review - a declining balance is an early warning of future TPV underperformance.

  • **loan_statusIN_ARREARS (*): IN_ARREARS merchants generate significantly lower TPV than CLOSED merchants (the reference group), even after controlling for loan size and balance. This quantifies the commercial cost of arrears beyond the credit loss itself - arrears merchants also underperform on revenue generation.

  • terminal_count and state: Neither reaches statistical significance after controlling for loan size and balance, suggesting that the apparent state-level and terminal-count effects seen in the visualisations are largely explained by differences in loan size and financial health across those groups.

Logistic model: The AUC measures the model’s ability to discriminate between arrears and non-arrears loans. The odds ratios identify which variables increase or decrease the probability of arrears - directly actionable by the credit underwriting team when setting loan conditions.

10 Integrated Findings

10.1 How the five analyses connect

The five analyses form a coherent analytical chain. EDA established the data quality baseline and identified the heavy right skew that governed all subsequent technique choices - log transformation in regression and Spearman correlation over Pearson. Visualisation surfaced the commercially important patterns: declining TPV trajectories for arrears merchants, state-level performance gaps, and the positive size-TPV relationship. Hypothesis testing confirmed that state-level differences are statistically significant and not sampling noise - providing the evidential standard needed to justify a differentiated geographic sales strategy. Correlation identified the strongest pairwise relationships and confirmed that disbursed amount and current account balance are the dominant drivers. Regression isolated the conditional effect of each driver: log_disb and log_cab together explain 67% of TPV variance, while terminal count and state lose significance once these two are controlled for.

10.2 The single actionable recommendation

On the basis of these five analyses, I recommend that in the next monthly business review the sales team implement a two-tier merchant intervention protocol. Tier 1 - merchants with disbursed amounts above the portfolio median AND current account balances that have declined for two consecutive months - should receive an immediate sales visit to diagnose TPV shortfall and be assessed for early-arrears referral to the credit team. Tier 2 - merchants with high disbursed amounts and stable or growing balances - should be prioritised for upsizing conversations, as the regression confirms these are the merchants whose TPV scales most reliably with additional facility size. This protocol operationalises the two significant regression coefficients directly into the monthly sales review process.

11 Limitations and Further Work

  • Three-month window: February-April 2026 is too short to separate seasonality from underlying trend.
  • Repeat-loan rows: A small number of merchants appear more than once. The regression treats these as independent observations; clustered standard errors at the merchant level would be more conservative.
  • months_on_book: This variable was all NA after reshaping due to missing disbursement date parsing for some records, and was excluded from the regression. A cleaner date field would allow vintage effects to be modelled.
  • Unobserved heterogeneity: Business sector, channel mix and merchant tenure are absent from the extract.

Further work: With more time I would extend to a 12-month panel with merchant fixed effects, add underwriting variables, reserve a holdout sample to evaluate the logistic model out-of-sample, and apply a survival model to account for the censored nature of active loans.

12 References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making - from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.6). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., Francois, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Muller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., & Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Onosade, G. (2026). Merchant POS loan performance dataset - February to April 2026 [Dataset]. Collected from loan management and merchant performance system, Lagos, Nigeria. Data available on request from the author.

13 Appendix - AI Usage Statement

Claude (Anthropic, claude.ai) was used to assist this study in two ways. First, it helped audit the structure of the data extract, identify column-naming discrepancies between the template and the actual CSV, and generate corrected R code for the data cleaning and reshaping pipeline. Second, it drafted boilerplate code for the visualisation, hypothesis testing, correlation and regression sections, which I reviewed and verified against my own understanding of the data and business context. All analytical decisions - the choice of case study, the hypotheses tested, the model specification, the interpretation of every result, and the integrated recommendation - are my own, made in line with my professional judgement as Sales and Business Director. The AI was used as a coding and editing assistant; no AI-generated interpretation appears in this document without my independent review and professional validation.