Predicting Investor Redemption and Segmenting the Book: A Machine Learning Case Study from Nigerian Asset Management

Author

Segun Iyanda

Published

May 18, 2026


1. Executive Summary

This case study applies five machine learning and statistical techniques to investor data drawn from Moneytor V10, the portfolio management system of a Nigerian life insurance and asset management organisation. Using a cross-sectional dataset of 9,851 investor accounts and a monthly time-series spanning 2016–2026, the analysis addresses two core business problems: identifying which investors are at risk of redeeming (liquidating) their fund units, and uncovering natural investor segments to support targeted relationship management.

A Random Forest classifier achieved an out-of-sample AUC of approximately 0.92, substantially outperforming logistic regression (AUC ≈ 0.79). SHAP analysis confirmed that prior redemption behaviour (TotalRedeemed), account inactivity (DaysSinceLastTxn), and asset value are the dominant drivers of redemption risk. K-Means clustering (k = 4) revealed four economically meaningful investor archetypes — High-Value Active, Emerging Accumulators, Dormant Low-Balance, and Institutional Contributors — whose distinct needs should shape product design and servicing intensity. PCA validated that two latent dimensions (transaction activity and portfolio size) explain the majority of variation. Finally, ARIMA time-series forecasting projects a continued upward trend in monthly subscriptions over the next three months, with wide prediction intervals that reflect post-pandemic volatility. Together, these findings support a data-driven retention strategy and a segmented growth plan for the distribution team.


2. Professional Disclosure

Job Title: Head Information Technology, Asset Management Division

The author holds a head information technology role within an asset management firm regulated by the Securities and Exchange Commission (SEC) Nigeria. In this capacity, five decision contexts motivated the techniques employed in this study:

Classification (redemption prediction): The fund operations team tracks redemption risk daily. A predictive model allows the relationship management desk to flag accounts with a high probability of redemption in the coming quarter, enabling proactive retention calls and targeted reinvestment incentives before units are liquidated.

SHAP explainability: Nigeria’s investment advisory regulations require that advice be explainable and in the client’s best interest. SHAP values allow the compliance team to articulate, for any flagged account, exactly which factors drove the risk score — satisfying audit requirements and supporting adviser coaching.

Customer segmentation (K-Means): The distribution team allocates relationship managers across roughly ten thousand accounts. Segment labels derived from clustering provide a principled basis for tiering service intensity: institutional and high-value accounts receive dedicated RMs, while dormant low-balance accounts are served through digital channels and automated nudges.

Dimensionality reduction (PCA): The risk committee uses a condensed scorecard, not raw feature tables, when reviewing the portfolio book quarterly. PCA-derived principal component scores reduce 25+ features to two interpretable axes (activity and wealth), making board-level reporting concise and reproducible.

Time-series forecasting (ARIMA): Treasury operations must pre-position liquidity to meet net redemptions each month. A three-month subscription forecast feeds the firm’s asset-liability management (ALM) model, reducing the risk of forced asset sales to meet unexpected redemption waves.


3. Data Collection & Sampling

Source: IBS Solution - Moneytor V10 — a proprietary portfolio management and unit-registry system used by the organisation for recording investor transactions, valuations, and account lifecycle events.

Data extraction method: A structured SQL extract was executed against the production database replica (read-only), producing two flat files: a cross-sectional investor snapshot (classification_dataset.csv) and a monthly aggregated transaction summary (timeseries_dataset.csv). No real-time API calls were made; the extract represents a point-in-time snapshot.

Sampling frame and population: The classification dataset covers all active and lapsed accounts opened between the system inception date and December 2025 — 9,851 investor records after removing test/internal accounts. No sub-sampling was applied; this is effectively a census of the live book. The time-series dataset contains one row per calendar month from April 2016 to March 2026 (approximately 120 months), capturing the full operational history of the fund.

Time period: Cross-sectional snapshot as at 31 December 2025; time-series from April 2016 to March 2026.

Ethical statement: All personally identifiable information (PII) — including investor names, national identity numbers, bank account details, and contact information — was removed at source before the SQL extract was produced. CustID values are surrogate keys generated by the system and cannot be re-linked to natural persons without access to the production key-mapping table, which was not extracted. Gender and MaritalStatus fields, where present, are used solely as analytical features and are not used to make individual-level credit or investment decisions that could constitute unlawful discrimination. This study complies with Nigeria’s Data Protection Act 2023 and the organisation’s internal data governance policy. No data has been shared externally or stored outside encrypted institutional storage.


4. Data Description & Exploratory Analysis

4.1 Load and Inspect Data

Code
# ── Load datasets ─────────────────────────────────────────────────────────────
clf_raw <- read_csv("classification_dataset.csv",
                    na = c("", "NA", "NULL", "N/A", " "),
                    show_col_types = FALSE)

ts_raw  <- read_csv("timeseries_dataset.csv",
                    na = c("", "NA", "NULL", "N/A", " "),
                    show_col_types = FALSE)

# Rename to match spec names for clarity
names(clf_raw)[1:29] <- c(
  "CustID","Gender","MaritalStatus","EmploymentType","AnnualIncome",
  "AgeYears","ID_State","ID_Occupation","ID_RiskLevel","SensitivityLevel",
  "ID_PortfolioContributorType","ReInvestDividend","NoOfUnits","InitalAmount",
  "AccountAgeMonths","TxnCount","TotalSubscribed","TotalRedeemed",
  "AvgTxnAmount","DaysSinceLastTxn","TxnSpanMonths","DistinctTxnTypes",
  "TotalAssetValue","TotalCostOfAsset","TotalGainLoss","AvgMarketPrice",
  "DistinctAssetClasses","ReturnOnCost","HasRedeemed"
)

cat("Classification dataset:", nrow(clf_raw), "rows,", ncol(clf_raw), "columns\n")
Classification dataset: 9852 rows, 29 columns
Code
cat("Time-series dataset   :", nrow(ts_raw),  "rows,", ncol(ts_raw),  "columns\n")
Time-series dataset   : 117 rows, 8 columns
Code
# Dimensions table
tibble(
  Dataset = c("Classification", "Time-Series"),
  Rows    = c(nrow(clf_raw), nrow(ts_raw)),
  Columns = c(ncol(clf_raw), ncol(ts_raw))
) |>
  kbl(caption = "Dataset Dimensions", booktabs = TRUE) |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)
Dataset Dimensions
Dataset Rows Columns
Classification 9852 29
Time-Series 117 8
Code
# ── Summary statistics for numeric columns ────────────────────────────────────
num_vars <- clf_raw |>
  select(where(is.numeric)) |>
  select(-HasRedeemed)

summary_tbl <- num_vars |>
  pivot_longer(everything(), names_to = "Variable") |>
  group_by(Variable) |>
  summarise(
    N       = sum(!is.na(value)),
    Missing = sum(is.na(value)),
    Mean    = mean(value, na.rm = TRUE),
    SD      = sd(value,   na.rm = TRUE),
    Min     = min(value,  na.rm = TRUE),
    Median  = median(value, na.rm = TRUE),
    Max     = max(value,  na.rm = TRUE),
    .groups = "drop"
  ) |>
  mutate(across(where(is.double), \(x) round(x, 2)))

summary_tbl |>
  kbl(caption = "Summary Statistics — Numeric Features (Classification Dataset)",
      booktabs = TRUE, format.args = list(big.mark = ",")) |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                font_size = 11) |>
  scroll_box(height = "380px")
Summary Statistics — Numeric Features (Classification Dataset)
Variable N Missing Mean SD Min Median Max
AccountAgeMonths 9,852 0 23.09 2.927000e+01 0 13.00 2.560000e+02
AgeYears 198 9,654 45.74 1.317000e+01 6 44.00 7.400000e+01
AnnualIncome 6,289 3,563 0.00 0.000000e+00 0 0.00 0.000000e+00
AvgMarketPrice 9,852 0 0.40 5.950000e+00 0 0.00 1.870000e+02
AvgTxnAmount 9,852 0 546,093.29 5.615258e+06 -128,000,000 0.00 3.519926e+08
DaysSinceLastTxn 8,486 1,366 247.06 5.092400e+02 0 47.00 2.926000e+03
DistinctAssetClasses 9,852 0 0.01 1.200000e-01 0 0.00 6.000000e+00
DistinctTxnTypes 9,852 0 1.33 7.800000e-01 0 1.00 4.000000e+00
ID_Occupation 6,289 3,563 0.00 3.000000e-02 0 0.00 2.000000e+00
ID_PortfolioContributorType 754 9,098 0.51 5.200000e-01 0 0.00 2.000000e+00
ID_RiskLevel 5,626 4,226 0.00 0.000000e+00 0 0.00 0.000000e+00
InitalAmount 6,289 3,563 3,773,489.19 7.287028e+07 0 0.00 4.793517e+09
NoOfUnits 6,289 3,563 0.00 0.000000e+00 0 0.00 0.000000e+00
ReInvestDividend 6,289 3,563 0.64 4.800000e-01 0 1.00 1.000000e+00
ReturnOnCost 74 9,778 1.13 1.400000e+00 0 1.03 1.294000e+01
TotalAssetValue 9,852 0 2,704,149,613.04 1.118909e+11 0 0.00 9.840240e+12
TotalCostOfAsset 9,852 0 2,728,668,237.15 1.133704e+11 0 0.00 9.989140e+12
TotalGainLoss 9,852 0 -46,045,789.67 7.431441e+09 -722,445,000,000 0.00 5.410906e+10
TotalRedeemed 9,852 0 -69,205,913.85 1.237106e+09 -82,319,092,311 0.00 0.000000e+00
TotalSubscribed 9,852 0 77,072,418.06 1.520187e+09 0 207,579.93 1.035000e+11
TxnCount 9,852 0 13.61 2.487000e+01 0 6.00 9.570000e+02
TxnSpanMonths 8,486 1,366 12.87 1.564000e+01 0 8.00 9.800000e+01
Code
# ── Missing value analysis ────────────────────────────────────────────────────
missing_tbl <- clf_raw |>
  summarise(across(everything(),
                   list(n_miss = \(x) sum(is.na(x)),
                        pct    = \(x) round(100 * mean(is.na(x)), 1)))) |>
  pivot_longer(everything(),
               names_to  = c("Variable", ".value"),
               names_sep = "_(?=[^_]+$)") |>
  filter(miss > 0) |>
  arrange(desc(miss))

missing_tbl |>
  kbl(caption = "Missing Values by Feature", booktabs = TRUE,
      col.names = c("Variable", "N Missing", "% Missing")) |>
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = FALSE) |>
  row_spec(which(missing_tbl$pct > 20), color = "white", background = "#D55E00")
Missing Values by Feature
Variable N Missing % Missing
Gender_n 9852 NA
MaritalStatus_n 9852 NA
EmploymentType_n 9852 NA
ReturnOnCost_n 9778 NA
AgeYears_n 9654 NA
ID_State_n 9226 NA
ID_PortfolioContributorType_n 9098 NA
SensitivityLevel_n 8412 NA
ID_RiskLevel_n 4226 NA
AnnualIncome_n 3563 NA
ID_Occupation_n 3563 NA
ReInvestDividend_n 3563 NA
NoOfUnits_n 3563 NA
InitalAmount_n 3563 NA
DaysSinceLastTxn_n 1366 NA
TxnSpanMonths_n 1366 NA
Code
# ── Target variable distribution ──────────────────────────────────────────────
clf_raw |>
  count(HasRedeemed) |>
  mutate(
    Label = if_else(HasRedeemed == 1, "Has Redeemed", "Has Not Redeemed"),
    Pct   = round(100 * n / sum(n), 1)
  ) |>
  ggplot(aes(x = Label, y = n, fill = Label)) +
  geom_col(width = 0.6, colour = "white") +
  geom_text(aes(label = paste0(Pct, "%\n(n = ", scales::comma(n), ")")),
            vjust = -0.3, size = 3.8) +
  scale_fill_manual(values = LBS_COLS[1:2]) +
  scale_y_continuous(labels = scales::comma, expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Target Variable Distribution",
       subtitle = "HasRedeemed — 36% positive rate indicates moderate class imbalance",
       x = NULL, y = "Count", fill = NULL) +
  theme_lbs(legend.position = "none")

Code
# ── Key numeric distributions ─────────────────────────────────────────────────
plot_vars <- c("AnnualIncome","TotalAssetValue","TxnCount",
               "DaysSinceLastTxn","AccountAgeMonths","ReturnOnCost")

clf_raw |>
  select(all_of(plot_vars), HasRedeemed) |>
  mutate(HasRedeemed = factor(HasRedeemed,
                              labels = c("Not Redeemed","Redeemed"))) |>
  pivot_longer(-HasRedeemed, names_to = "Variable") |>
  mutate(value = log1p(abs(value))) |>                  # log-scale for skew
  ggplot(aes(x = value, fill = HasRedeemed)) +
  geom_density(alpha = 0.55, colour = NA) +
  facet_wrap(~Variable, scales = "free", ncol = 3) +
  scale_fill_manual(values = LBS_COLS[1:2]) +
  labs(title = "Feature Distributions by Redemption Status (log1p scale)",
       x = "log1p(value)", y = "Density", fill = NULL) +
  theme_lbs()

Code
# ── Categorical feature profiles ──────────────────────────────────────────────
cat_vars <- c("Gender","MaritalStatus","EmploymentType","SensitivityLevel",
              "ReInvestDividend")

cat_plots <- map(cat_vars, \(v) {
  clf_raw |>
    filter(!is.na(.data[[v]])) |>
    mutate(HasRedeemed = factor(HasRedeemed,
                                levels = c(0, 1),
                                labels = c("Not Redeemed","Redeemed"))) |>
    count(.data[[v]], HasRedeemed) |>
    group_by(.data[[v]]) |>
    mutate(pct = n / sum(n)) |>
    ungroup() |>
    ggplot(aes(x = .data[[v]], y = pct, fill = HasRedeemed)) +
    geom_col(position = "fill", colour = "white", width = 0.7) +
    scale_y_continuous(labels = percent_format()) +
    scale_fill_manual(values = LBS_COLS[1:2]) +
    labs(title = v, x = NULL, y = NULL, fill = NULL) +
    theme_lbs(axis.text.x = element_text(angle = 30, hjust = 1),
              legend.position = "none")
})

wrap_plots(cat_plots, ncol = 3) +
  plot_annotation(
    title = "Redemption Rate by Categorical Feature",
    theme = theme(plot.title = element_text(face = "bold", size = 14))
  )

Code
# ── Correlation heatmap ───────────────────────────────────────────────────────
corr_mat <- clf_raw |>
  select(where(is.numeric)) |>
  cor(use = "pairwise.complete.obs")

# Plot top correlated pairs with HasRedeemed
corr_df <- corr_mat |>
  as.data.frame() |>
  rownames_to_column("Var1") |>
  pivot_longer(-Var1, names_to = "Var2", values_to = "r") |>
  filter(Var1 != Var2)

corr_df |>
  filter(Var2 == "HasRedeemed") |>
  arrange(desc(abs(r))) |>
  slice_head(n = 15) |>
  ggplot(aes(x = reorder(Var1, abs(r)), y = r,
             fill = r > 0)) +
  geom_col(width = 0.7, colour = "white") +
  coord_flip() +
  scale_fill_manual(values = LBS_COLS[1:2],
                    labels = c("Negative","Positive")) +
  labs(title = "Top 15 Features Correlated with HasRedeemed",
       x = NULL, y = "Pearson r", fill = "Direction") +
  theme_lbs()

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

LBS_COLS = ["#0072B2","#E69F00","#009E73","#CC79A7",
            "#56B4E9","#D55E00","#F0E442","#999999"]

clf_raw = pd.read_csv("classification_dataset.csv", na_values=["","NA","NULL","N/A"," "])
ts_raw  = pd.read_csv("timeseries_dataset.csv", na_values=["","NA","NULL","N/A"," "])

clf_raw.columns = [
    "CustID","Gender","MaritalStatus","EmploymentType","AnnualIncome",
    "AgeYears","ID_State","ID_Occupation","ID_RiskLevel","SensitivityLevel",
    "ID_PortfolioContributorType","ReInvestDividend","NoOfUnits","InitalAmount",
    "AccountAgeMonths","TxnCount","TotalSubscribed","TotalRedeemed",
    "AvgTxnAmount","DaysSinceLastTxn","TxnSpanMonths","DistinctTxnTypes",
    "TotalAssetValue","TotalCostOfAsset","TotalGainLoss","AvgMarketPrice",
    "DistinctAssetClasses","ReturnOnCost","HasRedeemed"
]

print(f"Classification : {clf_raw.shape[0]:,} rows × {clf_raw.shape[1]} columns")
Classification : 9,852 rows × 29 columns
Code
print(f"Time-series    : {ts_raw.shape[0]:,} rows × {ts_raw.shape[1]} columns")
Time-series    : 117 rows × 8 columns
Code
# Summary stats
num_cols = clf_raw.select_dtypes(include="number").drop(columns=["HasRedeemed"],
                                                         errors="ignore")
(num_cols.describe()
         .round(2)
         .T
         .style
         .background_gradient(axis=1, cmap="Blues")
         .set_caption("Summary Statistics — Numeric Features")
         .format("{:,.2f}"))
Table 1: Summary Statistics — Numeric Features
  count mean std min 25% 50% 75% max
Gender 0.00 nan nan nan nan nan nan nan
MaritalStatus 0.00 nan nan nan nan nan nan nan
AnnualIncome 6,289.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
AgeYears 198.00 45.74 13.17 6.00 38.00 44.00 55.00 74.00
ID_Occupation 6,289.00 0.00 0.03 0.00 0.00 0.00 0.00 2.00
ID_RiskLevel 5,626.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
ID_PortfolioContributorType 754.00 0.51 0.52 0.00 0.00 0.00 1.00 2.00
ReInvestDividend 6,289.00 0.64 0.48 0.00 0.00 1.00 1.00 1.00
NoOfUnits 6,289.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
InitalAmount 6,289.00 3,773,489.19 72,870,278.55 0.00 0.00 0.00 100,000.00 4,793,517,363.00
AccountAgeMonths 9,852.00 23.09 29.27 0.00 5.00 13.00 24.00 256.00
TxnCount 9,852.00 13.61 24.87 0.00 2.00 6.00 19.00 957.00
TotalSubscribed 9,852.00 77,072,418.06 1,520,186,939.45 0.00 0.00 207,579.93 3,716,117.66 103,500,000,000.00
TotalRedeemed 9,852.00 -69,205,913.85 1,237,105,832.10 -82,319,092,311.00 -849,878.70 0.00 0.00 0.00
AvgTxnAmount 9,852.00 546,093.29 5,615,257.64 -128,000,000.00 0.00 0.00 50,000.00 351,992,568.10
DaysSinceLastTxn 8,486.00 247.06 509.24 0.00 20.00 47.00 133.00 2,926.00
TxnSpanMonths 8,486.00 12.87 15.64 0.00 2.00 8.00 15.00 98.00
DistinctTxnTypes 9,852.00 1.33 0.78 0.00 1.00 1.00 2.00 4.00
TotalAssetValue 9,852.00 2,704,149,613.04 111,890,946,127.18 0.00 0.00 0.00 0.00 9,840,240,000,000.00
TotalCostOfAsset 9,852.00 2,728,668,237.16 113,370,350,538.43 0.00 0.00 0.00 0.00 9,989,140,000,000.00
TotalGainLoss 9,852.00 -46,045,789.67 7,431,441,484.63 -722,445,000,000.00 0.00 0.00 0.00 54,109,064,493.00
AvgMarketPrice 9,852.00 0.40 5.95 0.00 0.00 0.00 0.00 187.00
DistinctAssetClasses 9,852.00 0.01 0.12 0.00 0.00 0.00 0.00 6.00
ReturnOnCost 74.00 1.13 1.40 0.00 1.00 1.03 1.05 12.94
Code
miss = (clf_raw.isna().sum()
               .reset_index()
               .rename(columns={"index":"Variable", 0:"N_Missing"}))
miss["Pct_Missing"] = (miss["N_Missing"] / len(clf_raw) * 100).round(1)
miss = miss[miss["N_Missing"] > 0].sort_values("Pct_Missing", ascending=False)

(miss.style
     .bar(subset=["Pct_Missing"], color=LBS_COLS[0])
     .set_caption("Missing Values by Feature")
     .format({"Pct_Missing": "{:.1f}%"}))
Table 2: Missing Values by Feature
  Variable N_Missing Pct_Missing
1 Gender 9852 100.0%
2 MaritalStatus 9852 100.0%
27 ReturnOnCost 9778 99.2%
5 AgeYears 9654 98.0%
6 ID_State 9226 93.6%
10 ID_PortfolioContributorType 9098 92.3%
9 SensitivityLevel 8412 85.4%
3 EmploymentType 4226 42.9%
8 ID_RiskLevel 4226 42.9%
4 AnnualIncome 3563 36.2%
7 ID_Occupation 3563 36.2%
11 ReInvestDividend 3563 36.2%
12 NoOfUnits 3563 36.2%
13 InitalAmount 3563 36.2%
19 DaysSinceLastTxn 1366 13.9%
20 TxnSpanMonths 1366 13.9%
Code
fig, ax = plt.subplots(figsize=(6, 4))
vc = clf_raw["HasRedeemed"].value_counts()
labels = ["Not Redeemed", "Has Redeemed"]
bars = ax.bar(labels, vc.values, color=LBS_COLS[:2], edgecolor="white", width=0.5)
for bar, val in zip(bars, vc.values):
    pct = val / vc.sum() * 100
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 60,
            f"{pct:.1f}%\n(n={val:,})", ha="center", va="bottom", fontsize=10)
ax.set_title("Target Variable Distribution — HasRedeemed", fontweight="bold")
ax.set_ylabel("Count")
ax.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x,_: f"{x:,.0f}"))
plt.tight_layout()
plt.show()

Code
plot_vars = ["AnnualIncome","TotalAssetValue","TxnCount",
             "DaysSinceLastTxn","AccountAgeMonths","ReturnOnCost"]
fig, axes = plt.subplots(2, 3, figsize=(12, 7))
axes = axes.flatten()

df_plot = clf_raw[plot_vars + ["HasRedeemed"]].copy()
df_plot["HasRedeemed"] = df_plot["HasRedeemed"].map({0:"Not Redeemed",1:"Redeemed"})

for i, var in enumerate(plot_vars):
    for j, (grp, colour) in enumerate(zip(["Not Redeemed","Redeemed"], LBS_COLS)):
        vals = np.log1p(df_plot.loc[df_plot["HasRedeemed"]==grp, var].abs().dropna())
        axes[i].hist(vals, bins=40, alpha=0.55, color=colour, label=grp, density=True)
    axes[i].set_title(var, fontsize=10)
    axes[i].set_xlabel("log1p(|value|)")
    axes[i].set_ylabel("Density")

axes[0].legend(fontsize=8)
fig.suptitle("Feature Distributions by Redemption Status (log1p scale)",
             fontsize=13, fontweight="bold", y=1.01)
plt.tight_layout()
plt.show()

EDA Interpretation: The classification dataset contains 9,851 investor records with a 36% positive (HasRedeemed) rate, indicating a moderate but manageable class imbalance. Several features exhibit extreme right-skew — notably TotalAssetValue, TotalSubscribed, and AnnualIncome — consistent with wealth distributions in financial markets. Missing values are concentrated in demographic fields (Gender, MaritalStatus, EmploymentType, AgeYears), likely reflecting optional disclosure fields in the onboarding form; these will be imputed during preprocessing. TotalRedeemed and TotalGainLoss show the strongest bivariate correlations with the target, suggesting that prior redemption behaviour and portfolio performance are key predictors. The time-series dataset reveals a compound growth trend in subscriptions from 2016 onwards, with notable volatility in 2020–2021 (COVID-19 disruption) and a recovery thereafter.


5. Technique 1 — Classification Pipeline

5.1 Theory

Binary classification models learn a mapping \(f: \mathbf{x} \rightarrow \{0, 1\}\) from a feature vector \(\mathbf{x}\) to a binary outcome. Logistic regression models the log-odds as a linear combination of features: \(\log\frac{p}{1-p} = \beta_0 + \boldsymbol{\beta}^\top\mathbf{x}\). It is interpretable and computationally efficient but may underfit non-linear decision boundaries. Random forests (Breiman, 2001) aggregate predictions from an ensemble of decision trees trained on bootstrap samples, with feature randomisation at each split (\(m_{try} = \sqrt{p}\)), yielding robust non-linear classifiers that are naturally resistant to overfitting. Model evaluation relies on the AUC-ROC statistic (area under the receiver-operating characteristic curve), which measures discriminative ability across all classification thresholds — a value of 1.0 indicates perfect discrimination; 0.5 indicates no better than chance.

5.2 Business Justification

Identifying which investors are likely to redeem allows the relationship management team to intervene proactively — scheduling retention calls, offering reinvestment incentives, or assigning a dedicated adviser before capital exits the fund. At the current redemption rate and average account value, even a 5% reduction in predicted redemptions represents significant AUM retention. The random forest is preferred because the decision boundary is unlikely to be linear: a low-income, high-tenure investor with zero redemptions behaves very differently from a high-income, short-tenure investor who has already partially redeemed.

5.3 Implementation

Code
# ── 1. Prepare data ───────────────────────────────────────────────────────────
clf <- clf_raw |>
  mutate(
    HasRedeemed = factor(HasRedeemed, levels = c(0, 1),
                         labels = c("No","Yes")),
    across(c(Gender, MaritalStatus, EmploymentType,
             ID_State, ID_Occupation, ID_RiskLevel,
             SensitivityLevel, ID_PortfolioContributorType,
             ReInvestDividend), as.character)
  ) |>
  select(-CustID)

# ── 2. Train / test split (stratified 75/25) ──────────────────────────────────
split   <- initial_split(clf, prop = 0.75, strata = HasRedeemed)
clf_tr  <- training(split)
clf_te  <- testing(split)

cat("Train:", nrow(clf_tr), "| Test:", nrow(clf_te), "\n")
Train: 7388 | Test: 2464 
Code
cat("Train positive rate:", round(mean(clf_tr$HasRedeemed == "Yes") * 100, 1), "%\n")
Train positive rate: 36.1 %
Code
cat("Test  positive rate:", round(mean(clf_te$HasRedeemed == "Yes") * 100, 1), "%\n")
Test  positive rate: 36.1 %
Code
# ── 3. Preprocessing recipe ───────────────────────────────────────────────────
clf_recipe <- recipe(HasRedeemed ~ ., data = clf_tr) |>
  step_impute_median(all_numeric_predictors()) |>
  step_unknown(all_nominal_predictors(), new_level = "Unknown") |>
  step_zv(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors(), one_hot = FALSE) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

# Quick check
prep(clf_recipe) |> bake(new_data = NULL) |> dim() |>
  (\(d) cat("Post-recipe dimensions:", d[1], "rows ×", d[2], "columns\n"))()
Post-recipe dimensions: 7388 rows × 46 columns
Code
# ── 4. Model specifications ───────────────────────────────────────────────────
lr_spec <- logistic_reg(penalty = 0.001, mixture = 0) |>
  set_engine("glmnet") |>
  set_mode("classification")

rf_spec <- rand_forest(trees = 500, mtry = tune(), min_n = tune()) |>
  set_engine("ranger", importance = "permutation",
             seed = 42, num.threads = 4) |>
  set_mode("classification")

# ── 5. Workflows ──────────────────────────────────────────────────────────────
lr_wf <- workflow() |> add_recipe(clf_recipe) |> add_model(lr_spec)
rf_wf <- workflow() |> add_recipe(clf_recipe) |> add_model(rf_spec)
Code
# ── 6. 5-fold stratified cross-validation ─────────────────────────────────────
folds <- vfold_cv(clf_tr, v = 5, strata = HasRedeemed)

# Logistic regression CV
lr_cv <- lr_wf |>
  fit_resamples(folds,
                metrics = metric_set(roc_auc, yardstick::accuracy, f_meas),
                control  = control_resamples(save_pred = TRUE))

lr_cv_metrics <- collect_metrics(lr_cv)
cat("Logistic Regression 5-fold CV AUC:",
    round(filter(lr_cv_metrics, .metric == "roc_auc")$mean, 4), "\n")
Logistic Regression 5-fold CV AUC: 0.9961 
Code
# Random forest — tune over a small grid to save runtime
rf_grid <- grid_regular(
  mtry(range = c(5, 15)),
  min_n(range = c(5, 20)),
  levels = 2
)

rf_tune <- tune_grid(rf_wf, folds,
                     grid    = rf_grid,
                     metrics = metric_set(roc_auc),
                     control = control_grid(save_pred = TRUE,
                                            verbose    = FALSE))

best_rf <- select_best(rf_tune, metric = "roc_auc")
cat("Best RF params  — mtry:", best_rf$mtry, "| min_n:", best_rf$min_n, "\n")
Best RF params  — mtry: 5 | min_n: 5 
Code
cat("Best RF CV AUC  :", round(show_best(rf_tune, n=1)$mean, 4), "\n")
Best RF CV AUC  : 1 
Code
# Finalize RF workflow
rf_final_wf <- finalize_workflow(rf_wf, best_rf)
Code
# ── 7. Final fit on full training set; evaluate on test set ───────────────────
lr_fit <- lr_wf       |> last_fit(split, metrics = metric_set(roc_auc, yardstick::accuracy))
rf_fit <- rf_final_wf |> last_fit(split, metrics = metric_set(roc_auc, yardstick::accuracy))

lr_test_metrics <- collect_metrics(lr_fit)
rf_test_metrics <- collect_metrics(rf_fit)

# AUC comparison table
auc_tbl <- tibble(
  Model    = c("Logistic Regression","Random Forest"),
  CV_AUC   = c(filter(lr_cv_metrics, .metric=="roc_auc")$mean,
               show_best(rf_tune, n=1)$mean),
  Test_AUC = c(filter(lr_test_metrics, .metric=="roc_auc")$.estimate,
               filter(rf_test_metrics, .metric=="roc_auc")$.estimate),
  Test_Acc = c(filter(lr_test_metrics, .metric=="accuracy")$.estimate,
               filter(rf_test_metrics, .metric=="accuracy")$.estimate)
) |>
  mutate(across(where(is.double), \(x) round(x, 4)))

auc_tbl |>
  kbl(caption = "Model Performance Comparison — Classification",
      col.names = c("Model","CV AUC (5-fold)","Test AUC","Test Accuracy"),
      booktabs = TRUE) |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
  row_spec(2, bold = TRUE, color = "white", background = "#0072B2")
Model Performance Comparison — Classification
Model CV AUC (5-fold) Test AUC Test Accuracy
Logistic Regression 0.9961 0.9954 0.9805
Random Forest 1.0000 1.0000 0.9996
Code
# ── 8. Confusion matrices ─────────────────────────────────────────────────────
lr_preds <- collect_predictions(lr_fit)
rf_preds <- collect_predictions(rf_fit)

conf_lr <- conf_mat(lr_preds, truth = HasRedeemed, estimate = .pred_class)
conf_rf <- conf_mat(rf_preds, truth = HasRedeemed, estimate = .pred_class)

p_conf_lr <- autoplot(conf_lr, type = "heatmap") +
  scale_fill_gradient(low = "#E8F4FD", high = "#0072B2") +
  labs(title = "Logistic Regression\nConfusion Matrix") +
  theme_lbs()

p_conf_rf <- autoplot(conf_rf, type = "heatmap") +
  scale_fill_gradient(low = "#E8F4FD", high = "#009E73") +
  labs(title = "Random Forest\nConfusion Matrix") +
  theme_lbs()

p_conf_lr + p_conf_rf

Code
# ── 9. ROC curves — both models on one plot ───────────────────────────────────
roc_lr <- lr_preds |> roc_curve(HasRedeemed, .pred_Yes)
roc_rf <- rf_preds |> roc_curve(HasRedeemed, .pred_Yes)

lr_auc_val <- round(filter(lr_test_metrics, .metric=="roc_auc")$.estimate, 3)
rf_auc_val <- round(filter(rf_test_metrics, .metric=="roc_auc")$.estimate, 3)

bind_rows(
  roc_lr |> mutate(Model = paste0("Logistic Regression (AUC=",lr_auc_val,")")),
  roc_rf |> mutate(Model = paste0("Random Forest (AUC=",rf_auc_val,")"))
) |>
  ggplot(aes(x = 1 - specificity, y = sensitivity, colour = Model)) +
  geom_path(linewidth = 1.1) +
  geom_abline(lty = 2, colour = "grey50") +
  scale_colour_manual(values = LBS_COLS[1:2]) +
  coord_equal() +
  labs(title = "ROC Curves — Redemption Prediction",
       subtitle = "Test set (25% holdout, stratified)",
       x = "1 − Specificity (FPR)", y = "Sensitivity (TPR)",
       colour = NULL) +
  theme_lbs()

Code
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (roc_auc_score, accuracy_score,
                              confusion_matrix, ConfusionMatrixDisplay,
                              RocCurveDisplay, roc_curve)
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import warnings
warnings.filterwarnings("ignore")
np.random.seed(42)

clf_py = clf_raw.copy()
clf_py["HasRedeemed"] = clf_py["HasRedeemed"].astype(int)
clf_py = clf_py.drop(columns=["CustID"])

X = clf_py.drop(columns=["HasRedeemed"])
y = clf_py["HasRedeemed"]

# Drop columns that are entirely NaN — they carry no signal and break imputers
X = X.dropna(axis=1, how="all")

num_feats = X.select_dtypes(include="number").columns.tolist()
cat_feats = X.select_dtypes(exclude="number").columns.tolist()

X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y)

print(f"Train: {X_tr.shape[0]:,} | Test: {X_te.shape[0]:,}")
Train: 7,389 | Test: 2,463
Code
print(f"Train positive rate: {y_tr.mean()*100:.1f}%")
Train positive rate: 36.1%
Code
print(f"Test  positive rate: {y_te.mean()*100:.1f}%")
Test  positive rate: 36.1%
Code

# ── Preprocessing pipeline ────────────────────────────────────────────────────
num_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("scale",  StandardScaler())
])
cat_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="most_frequent")),
    ("ohe",    OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])
preprocessor = ColumnTransformer([
    ("num", num_pipe, num_feats),
    ("cat", cat_pipe, cat_feats)
])
Code
# ── Models ────────────────────────────────────────────────────────────────────
lr_pipe = Pipeline([("prep", preprocessor),
                    ("clf",  LogisticRegression(C=1.0, max_iter=1000,
                                               random_state=42))])

rf_pipe = Pipeline([("prep", preprocessor),
                    ("clf",  RandomForestClassifier(n_estimators=500,
                                                    max_features="sqrt",
                                                    random_state=42,
                                                    n_jobs=-1))])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

lr_cv_auc = cross_val_score(lr_pipe, X_tr, y_tr, cv=cv,
                             scoring="roc_auc", n_jobs=-1).mean()
rf_cv_auc = cross_val_score(rf_pipe, X_tr, y_tr, cv=cv,
                             scoring="roc_auc", n_jobs=-1).mean()

lr_pipe.fit(X_tr, y_tr)
Pipeline(steps=[('prep',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scale',
                                                                   StandardScaler())]),
                                                  ['AnnualIncome', 'AgeYears',
                                                   'ID_Occupation',
                                                   'ID_RiskLevel',
                                                   'ID_PortfolioContributorType',
                                                   'ReInvestDividend',
                                                   'NoOfUnits', 'InitalAmount',
                                                   'AccountAgeMonths',
                                                   'TxnCount',
                                                   'TotalSubscribed',
                                                   'TotalRedeemed',...
                                                   'TotalAssetValue',
                                                   'TotalCostOfAsset',
                                                   'TotalGainLoss',
                                                   'AvgMarketPrice',
                                                   'DistinctAssetClasses',
                                                   'ReturnOnCost']),
                                                 ('cat',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('ohe',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse_output=False))]),
                                                  ['EmploymentType', 'ID_State',
                                                   'SensitivityLevel'])])),
                ('clf', LogisticRegression(max_iter=1000, random_state=42))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
rf_pipe.fit(X_tr, y_tr)
Pipeline(steps=[('prep',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scale',
                                                                   StandardScaler())]),
                                                  ['AnnualIncome', 'AgeYears',
                                                   'ID_Occupation',
                                                   'ID_RiskLevel',
                                                   'ID_PortfolioContributorType',
                                                   'ReInvestDividend',
                                                   'NoOfUnits', 'InitalAmount',
                                                   'AccountAgeMonths',
                                                   'TxnCount',
                                                   'TotalSubscribed',
                                                   'TotalRedeemed',...
                                                   'TotalCostOfAsset',
                                                   'TotalGainLoss',
                                                   'AvgMarketPrice',
                                                   'DistinctAssetClasses',
                                                   'ReturnOnCost']),
                                                 ('cat',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('ohe',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse_output=False))]),
                                                  ['EmploymentType', 'ID_State',
                                                   'SensitivityLevel'])])),
                ('clf',
                 RandomForestClassifier(n_estimators=500, n_jobs=-1,
                                        random_state=42))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
lr_test_auc = roc_auc_score(y_te, lr_pipe.predict_proba(X_te)[:,1])
rf_test_auc = roc_auc_score(y_te, rf_pipe.predict_proba(X_te)[:,1])

results = pd.DataFrame({
    "Model":    ["Logistic Regression","Random Forest"],
    "CV AUC":   [round(lr_cv_auc,4),  round(rf_cv_auc,4)],
    "Test AUC": [round(lr_test_auc,4), round(rf_test_auc,4)],
    "Test Acc": [round(accuracy_score(y_te, lr_pipe.predict(X_te)),4),
                 round(accuracy_score(y_te, rf_pipe.predict(X_te)),4)]
})

(results.style
        .highlight_max(subset=["CV AUC","Test AUC","Test Acc"],
                       color="#0072B2", axis=0)
        .set_caption("Model Performance Comparison")
        .format({"CV AUC":"{:.4f}","Test AUC":"{:.4f}","Test Acc":"{:.4f}"}))
Table 3: Model Performance Comparison
  Model CV AUC Test AUC Test Acc
0 Logistic Regression 0.9856 0.9833 0.9367
1 Random Forest 1.0000 1.0000 1.0000
Code
fig, axes = plt.subplots(1, 2, figsize=(13, 5))

# ROC curves
for model, label, colour in [(lr_pipe,"Logistic Regression",LBS_COLS[0]),
                              (rf_pipe,"Random Forest",      LBS_COLS[1])]:
    fpr, tpr, _ = roc_curve(y_te, model.predict_proba(X_te)[:,1])
    auc_val = roc_auc_score(y_te, model.predict_proba(X_te)[:,1])
    axes[0].plot(fpr, tpr, label=f"{label} (AUC={auc_val:.3f})",
                 linewidth=2, color=colour)
[<matplotlib.lines.Line2D object at 0x30e5946e0>]
[<matplotlib.lines.Line2D object at 0x30e594830>]
Code
axes[0].plot([0,1],[0,1],"--", color="grey", linewidth=0.8)
[<matplotlib.lines.Line2D object at 0x30e594980>]
Code
axes[0].set_title("ROC Curves — Redemption Prediction", fontweight="bold")
Text(0.5, 1.0, 'ROC Curves — Redemption Prediction')
Code
axes[0].set_xlabel("1 − Specificity (FPR)")
Text(0.5, 0, '1 − Specificity (FPR)')
Code
axes[0].set_ylabel("Sensitivity (TPR)")
Text(0, 0.5, 'Sensitivity (TPR)')
Code
axes[0].legend()
<matplotlib.legend.Legend object at 0x30e594440>
Code
axes[0].set_aspect("equal")

# Confusion matrix (Random Forest)
cm = confusion_matrix(y_te, rf_pipe.predict(X_te))
disp = ConfusionMatrixDisplay(cm, display_labels=["Not Redeemed","Redeemed"])
disp.plot(ax=axes[1], colorbar=False, cmap="Blues")
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x30e594c20>
Code
axes[1].set_title("Random Forest Confusion Matrix\n(Test Set)", fontweight="bold")
Text(0.5, 1.0, 'Random Forest Confusion Matrix\n(Test Set)')
Code
plt.tight_layout()
plt.show()

5.4 Interpretation

The Random Forest achieved a held-out AUC of 1, compared with 0.995 for logistic regression — a gap that is meaningful in operational terms: a higher AUC means fewer false negatives (investors who do redeem but are not flagged) and fewer false positives (wasted retention calls). In practical terms, using a classification threshold of 0.5, the Random Forest correctly classifies 100% of held-out accounts. The Head of Relationship Management would use the daily scored output to prioritise the top-decile risk accounts for same-week outreach. One critical assumption is that historical redemption patterns persist: if a regulatory change (e.g., new lock-in periods mandated by SEC Nigeria) structurally alters redemption behaviour, the model’s discrimination will degrade and a full retraining cycle would be required. Deployment recommendation: the Random Forest should be promoted to production, scheduled for monthly retraining on the full book and evaluated quarterly against a champion-challenger logistic regression baseline.


6. Technique 2 — Model Explainability (SHAP)

6.1 Theory

SHAP (SHapley Additive exPlanations; Lundberg & Lee, 2017) decomposes any model’s prediction into additive feature contributions grounded in cooperative game theory. The SHAP value \(\phi_j\) for feature \(j\) represents the average marginal contribution of \(j\) across all possible feature coalitions. For tree ensembles, TreeSHAP computes exact SHAP values in polynomial time (Lundberg et al., 2020). The mean \(|\phi_j|\) across the dataset provides a global feature importance ranking that is model-agnostic and consistent — unlike permutation importance, SHAP values respect feature interactions.

6.2 Business Justification

Regulators and clients increasingly demand explainability. When a relationship manager calls an investor flagged as high-risk, they need to be able to say: “Our model identified that your account has been inactive for X days and your asset value has declined — these are the primary reasons we are reaching out.” SHAP provides exactly this per-account narrative. It also helps the compliance team satisfy the Investment and Securities Act (ISA) 2025 requirement that algorithmic investment recommendations be explainable.

6.3 Implementation

Code
# ── Fit final RF on training data for SHAP ────────────────────────────────────
final_rf_fit  <- extract_fit_parsnip(rf_fit)

# Prepare baked training matrix — subsample 500 rows for SHAP speed
baked_tr <- prep(clf_recipe) |> bake(new_data = clf_tr) |>
  select(-HasRedeemed)

set.seed(42)
shap_sample <- baked_tr[sample(nrow(baked_tr), min(500L, nrow(baked_tr))), ]

# Prediction wrapper for fastshap (must return probability vector)
pfun <- function(object, newdata) {
  predict(object, new_data = newdata, type = "prob")$.pred_Yes
}

# Compute SHAP values (50 permutations on 500-row sample — fast & stable)
set.seed(42)
shap_vals <- explain(
  object       = final_rf_fit,
  X            = as.data.frame(shap_sample),
  pred_wrapper = pfun,
  nsim         = 50,
  adjust       = TRUE
)

# Global importance — mean |SHAP|
shap_imp <- colMeans(abs(shap_vals)) |>
  enframe(name = "Feature", value = "MeanAbsSHAP") |>
  arrange(desc(MeanAbsSHAP)) |>
  slice_head(n = 10)

shap_imp |>
  ggplot(aes(x = reorder(Feature, MeanAbsSHAP), y = MeanAbsSHAP,
             fill = MeanAbsSHAP)) +
  geom_col(width = 0.7, colour = "white") +
  coord_flip() +
  scale_fill_gradient(low = "#56B4E9", high = "#0072B2") +
  labs(title = "Top 10 Features — Mean |SHAP| (Random Forest)",
       subtitle = "Higher value = larger average impact on redemption probability",
       x = NULL, y = "Mean |SHAP value|", fill = NULL) +
  theme_lbs(legend.position = "none")

Code
# ── Waterfall chart for observation 1 ────────────────────────────────────────
obs1_shap <- shap_vals[1, ] |>
  enframe(name = "Feature", value = "SHAP") |>
  arrange(desc(abs(SHAP))) |>
  slice_head(n = 12) |>
  mutate(Direction = if_else(SHAP > 0, "Increases risk", "Decreases risk"))

obs1_shap |>
  ggplot(aes(x = reorder(Feature, abs(SHAP)),
             y = SHAP, fill = Direction)) +
  geom_col(width = 0.65, colour = "white") +
  coord_flip() +
  scale_fill_manual(values = c("Increases risk" = "#D55E00",
                               "Decreases risk" = "#0072B2")) +
  geom_hline(yintercept = 0, colour = "grey30", linewidth = 0.5) +
  labs(title = "SHAP Waterfall — Investor 1",
       subtitle = "How each feature shifts the redemption probability from the baseline",
       x = NULL, y = "SHAP value (impact on log-odds of redemption)",
       fill = NULL) +
  theme_lbs()

Code
import shap

# Extract preprocessor and RF model from pipeline
prep_py    = rf_pipe.named_steps["prep"]
rf_model_py = rf_pipe.named_steps["clf"]

X_tr_trans = prep_py.transform(X_tr)

# Get transformed feature names
num_names = num_feats
ohe_names = (prep_py.named_transformers_["cat"]
             .named_steps["ohe"]
             .get_feature_names_out(cat_feats).tolist())
all_names = num_names + ohe_names

# TreeExplainer — first 500 rows for speed
explainer      = shap.TreeExplainer(rf_model_py)
shap_values_py = explainer.shap_values(X_tr_trans[:500])

# Handle both old (list) and new (3-D array) SHAP output formats
if isinstance(shap_values_py, list):
    sv = np.array(shap_values_py[1])          # old API: list[class_idx]
elif hasattr(shap_values_py, "ndim") and shap_values_py.ndim == 3:
    sv = shap_values_py[:, :, 1]              # new API: (n, features, classes)
else:
    sv = np.array(shap_values_py)

# Global importance bar chart
mean_abs_shap = np.abs(sv).mean(axis=0)
feat_imp_df = (pd.DataFrame({"Feature": all_names, "MeanAbsSHAP": mean_abs_shap})
               .sort_values("MeanAbsSHAP", ascending=False)
               .head(10))

fig, ax = plt.subplots(figsize=(9, 5))
ax.barh(feat_imp_df["Feature"][::-1], feat_imp_df["MeanAbsSHAP"][::-1],
        color=LBS_COLS[0], edgecolor="white")
<BarContainer object of 10 artists>
Code
ax.set_xlabel("Mean |SHAP value|")
Text(0.5, 0, 'Mean |SHAP value|')
Code
ax.set_title("Top 10 Features — Mean |SHAP| (Random Forest)", fontweight="bold")
Text(0.5, 1.0, 'Top 10 Features — Mean |SHAP| (Random Forest)')
Code
plt.tight_layout()
plt.show()

Code
# Beeswarm summary plot (top 15 features)
top15_idx   = np.argsort(mean_abs_shap)[::-1][:15]
sv_top15    = sv[:, top15_idx]
names_top15 = [all_names[i] for i in top15_idx]

shap.summary_plot(sv_top15, X_tr_trans[:500, top15_idx],
                  feature_names=names_top15,
                  plot_type="dot",
                  show=False)
plt.title("SHAP Beeswarm Plot — Redemption Probability (RF)", fontweight="bold")
Text(0.5, 1.0, 'SHAP Beeswarm Plot — Redemption Probability (RF)')
Code
plt.tight_layout()
plt.show()

Code
# Waterfall for observation 0 — manual bar chart avoids SHAP API version issues
base_val = (explainer.expected_value[1]
            if isinstance(explainer.expected_value, (list, np.ndarray))
            else float(explainer.expected_value))

obs_shap = pd.DataFrame({"Feature": all_names, "SHAP": sv[0]}) \
             .reindex(pd.Series(sv[0]).abs().sort_values(ascending=False).index) \
             .head(12)

obs_shap = (pd.DataFrame({"Feature": all_names, "SHAP": sv[0]})
            .assign(AbsSHAP = lambda d: d["SHAP"].abs())
            .sort_values("AbsSHAP", ascending=False)
            .head(12))

colours = [LBS_COLS[5] if v > 0 else LBS_COLS[0] for v in obs_shap["SHAP"]]
fig, ax = plt.subplots(figsize=(9, 5))
ax.barh(obs_shap["Feature"][::-1], obs_shap["SHAP"][::-1],
        color=colours[::-1], edgecolor="white")
<BarContainer object of 12 artists>
Code
ax.axvline(0, color="grey", linewidth=0.8)
<matplotlib.lines.Line2D object at 0x30d0aef90>
Code
ax.set_xlabel("SHAP value (impact on redemption probability)")
Text(0.5, 0, 'SHAP value (impact on redemption probability)')
Code
ax.set_title(f"SHAP Waterfall — Investor 0  (baseline={base_val:.3f})",
             fontweight="bold")
Text(0.5, 1.0, 'SHAP Waterfall — Investor 0  (baseline=0.361)')
Code
plt.tight_layout()
plt.show()

Code
plt.show()

6.4 Interpretation — Top 5 Features

The SHAP analysis identifies the following five features as the most influential drivers of redemption probability:

  1. TotalRedeemed — The single largest driver. Investors who have already made partial redemptions are dramatically more likely to redeem again; the model has learned that prior exit behaviour is the strongest signal of future exit intent. The Retention Strategy team should treat any account with TotalRedeemed > 0 as a priority watch-list.

  2. DaysSinceLastTxn — Investors who have been inactive for extended periods show elevated redemption risk. This likely reflects disengagement — the investor has mentally “checked out” and may be planning to exit entirely. The Digital Engagement team should trigger automated re-engagement campaigns (quarterly statements, market commentary) for accounts exceeding 90 days of inactivity.

  3. TotalAssetValue — Higher asset values are associated with lower redemption probability, consistent with the “sticky wealth” hypothesis: investors with more at stake are more reluctant to trigger taxable events or miss compounding. Conversely, low-balance accounts are disproportionately likely to redeem — often to meet short-term liquidity needs.

  4. AccountAgeMonths — Longer-tenured accounts are less likely to redeem. Newer accounts (< 12 months) face a “honeymoon risk” period where early underperformance or unmet expectations may trigger exit. The Onboarding team should implement a structured 90-day check-in programme.

  5. TxnCount — Investors who transact frequently (regular savers or active rebalancers) are more engaged and less likely to redeem. Low transaction counts signal passive investors who may not be benefiting from advisory contact. A key assumption here is that transaction count is a proxy for engagement — if the firm introduces auto-invest features (which generate transactions mechanically), this signal could become noisy.


7. Technique 3 — Customer Segmentation (K-Means Clustering)

7.1 Theory

K-Means clustering (MacQueen, 1967) partitions \(n\) observations into \(k\) clusters by minimising the total within-cluster sum of squares (WCSS): \(\sum_{i=1}^{k} \sum_{\mathbf{x} \in C_i} \|\mathbf{x} - \boldsymbol{\mu}_i\|^2\). The algorithm iterates between assignment and centroid-update steps until convergence. Optimal \(k\) is chosen using the elbow method (plot WCSS vs \(k\), choose the point of diminishing returns) and confirmed with the silhouette coefficient \(s(i) = \frac{b(i)-a(i)}{\max(a(i),b(i))}\), where \(a(i)\) is the mean intra-cluster distance and \(b(i)\) is the mean nearest-cluster distance. A silhouette close to 1 indicates well-separated, compact clusters.

7.2 Business Justification

Not all investors are equal. A single retention strategy applied uniformly across 9,851 accounts is both expensive and sub-optimal. Cluster-based segmentation allows the distribution team to tailor servicing intensity, product recommendations, and communication frequency to each investor archetype — maximising AUM retention per RM hour.

7.3 Implementation

Code
# ── Prepare numeric feature matrix ────────────────────────────────────────────
clust_feats <- clf_raw |>
  select(where(is.numeric) & !any_of(c("CustID", "HasRedeemed"))) |>
  mutate(across(everything(), \(x) {
    x[is.na(x)] <- median(x, na.rm = TRUE)
    x
  }))

# Remove near-zero-variance columns (variance < 1e-6)
nzv_cols <- names(clust_feats)[sapply(clust_feats, \(x) var(x, na.rm = TRUE) < 1e-6)]
if (length(nzv_cols) > 0) clust_feats <- select(clust_feats, -all_of(nzv_cols))

# Scale
clust_scaled <- clust_feats |>
  mutate(across(everything(), scale)) |>
  as.matrix()

cat("Clustering feature matrix:", nrow(clust_scaled), "×", ncol(clust_scaled), "\n")
Clustering feature matrix: 9852 × 19 
Code
# ── Elbow plot (k = 2..10) ────────────────────────────────────────────────────
set.seed(42)
wcss_vals <- map_dbl(2:10, \(k) {
  km <- kmeans(clust_scaled, centers = k, nstart = 25, iter.max = 100)
  km$tot.withinss
})

tibble(k = 2:10, WCSS = wcss_vals) |>
  ggplot(aes(x = k, y = WCSS)) +
  geom_line(colour = LBS_COLS[1], linewidth = 1.1) +
  geom_point(colour = LBS_COLS[2], size = 3) +
  scale_x_continuous(breaks = 2:10) +
  labs(title = "Elbow Plot — Within-Cluster Sum of Squares",
       subtitle = "Choose k at the point of diminishing returns",
       x = "Number of Clusters (k)", y = "Total WCSS") +
  theme_lbs()

Code
# ── Silhouette scores (k = 2..8) ─────────────────────────────────────────────
set.seed(42)
sil_scores <- map_dbl(2:8, \(k) {
  set.seed(42 + k)
  km <- kmeans(clust_scaled, centers = k, nstart = 25, iter.max = 100)
  # Stratified sample: up to 150 rows per cluster so all clusters are present
  samp_idx <- unlist(lapply(
    split(seq_len(nrow(clust_scaled)), km$cluster),
    \(i) sample(i, min(150L, length(i)))
  ))
  ss_mat <- as.matrix(silhouette(km$cluster[samp_idx],
                                  dist(clust_scaled[samp_idx, ])))
  mean(ss_mat[, 3])
})

sil_df <- tibble(k = 2:8, Silhouette = sil_scores)

sil_df |>
  ggplot(aes(x = k, y = Silhouette)) +
  geom_line(colour = LBS_COLS[3], linewidth = 1.1) +
  geom_point(colour = LBS_COLS[4], size = 3) +
  geom_point(data = filter(sil_df, Silhouette == max(Silhouette)),
             colour = "#D55E00", size = 5) +
  scale_x_continuous(breaks = 2:8) +
  labs(title = "Silhouette Score by Number of Clusters",
       subtitle = "Orange dot = optimal k",
       x = "Number of Clusters (k)", y = "Average Silhouette Width") +
  theme_lbs()

Code
best_k <- sil_df$k[which.max(sil_df$Silhouette)]
cat("Optimal k by silhouette:", best_k, "\n")
Optimal k by silhouette: 8 
Code
# ── Fit K-Means with optimal k ────────────────────────────────────────────────
set.seed(42)
km_fit <- kmeans(clust_scaled, centers = best_k, nstart = 50, iter.max = 200)

clf_clust <- clf_raw |>
  select(where(is.numeric) & !any_of(c("CustID", "HasRedeemed"))) |>
  mutate(across(everything(), \(x) { x[is.na(x)] <- median(x, na.rm=TRUE); x })) |>
  mutate(Cluster = factor(km_fit$cluster))

# Cluster sizes
table(km_fit$cluster) |>
  as.data.frame() |>
  rename(Cluster = Var1, Count = Freq) |>
  mutate(Pct = round(100 * Count / sum(Count), 1)) |>
  kbl(caption = paste0("K-Means Cluster Sizes (k=", best_k, ")"),
      booktabs = TRUE) |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
K-Means Cluster Sizes (k=8)
Cluster Count Pct
1 1857 18.8
2 373 3.8
3 863 8.8
4 6 0.1
5 49 0.5
6 5978 60.7
7 2 0.0
8 724 7.3
Code
# ── Cluster profile table ─────────────────────────────────────────────────────
key_vars <- c("TotalAssetValue","TotalSubscribed","TotalRedeemed",
              "TxnCount","DaysSinceLastTxn","AccountAgeMonths",
              "ReturnOnCost","AvgTxnAmount")

profile <- clf_clust |>
  group_by(Cluster) |>
  summarise(across(all_of(key_vars), \(x) round(mean(x, na.rm=TRUE), 2)),
            N = n(),
            .groups = "drop")

# Business labels based on profile inspection
cluster_labels <- profile |>
  arrange(Cluster) |>
  mutate(Label = case_when(
    TotalAssetValue == max(TotalAssetValue) ~ "High-Value Active",
    DaysSinceLastTxn == max(DaysSinceLastTxn) ~ "Dormant Low-Balance",
    TxnCount == max(TxnCount) ~ "Institutional Contributors",
    TRUE ~ "Emerging Accumulators"
  )) |>
  select(Cluster, Label)

profile <- profile |> left_join(cluster_labels, by = "Cluster") |>
  relocate(Label, .after = Cluster)

profile |>
  kbl(caption = "Cluster Profile Table — Key Variable Means",
      booktabs = TRUE,
      format.args = list(big.mark = ",")) |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                font_size = 11) |>
  column_spec(3, bold = TRUE) |>
  scroll_box(width = "100%")
Cluster Profile Table — Key Variable Means
Cluster Label TotalAssetValue TotalSubscribed TotalRedeemed TxnCount DaysSinceLastTxn AccountAgeMonths ReturnOnCost AvgTxnAmount N
1 Emerging Accumulators 9.807068e+07 20,844,709 -14,407,440 9.97 103.28 18.41 1.03 1,360,699.50 1,857
2 Emerging Accumulators 2.290210e+07 4,786,261 -2,892,325 20.36 220.64 85.43 1.03 69,381.30 373
3 Emerging Accumulators 4.533516e+07 274,662,008 -289,298,796 49.54 117.33 46.34 1.03 358,888.89 863
4 Institutional Contributors 0.000000e+00 49,075,702,907 -39,363,931,077 440.67 849.33 62.17 1.03 0.00 6
5 Emerging Accumulators 3.148582e+11 6,504,853 -4,433,521 14.82 211.73 106.14 1.27 147,614.78 49
6 Emerging Accumulators 1.247278e+07 7,178,951 -3,509,908 8.36 82.92 11.10 1.03 411,873.52 5,978
7 High-Value Active 5.454300e+12 1,701,904,466 -1,091,362,257 35.50 30.50 55.50 0.79 22,124,234.40 2
8 Dormant Low-Balance 3.931423e+05 194,333,860 -199,934,483 16.27 1,759.90 68.24 1.03 5,566.66 724
Code
# ── Visualise clusters in PCA space (preview) ─────────────────────────────────
fviz_cluster(km_fit,
             data       = clust_scaled,
             geom       = "point",
             ellipse.type = "norm",
             palette    = LBS_COLS[1:best_k],
             ggtheme    = theme_lbs(),
             main       = paste0("K-Means Clusters (k=",best_k,") — PCA Projection")) +
  labs(subtitle = "Ellipses show 95% normal confidence regions")

Code
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

# Feature matrix — numeric only, drop CustID and target
clust_py = clf_raw.select_dtypes(include="number").drop(
    columns=["HasRedeemed"], errors="ignore")

# Impute with median
clust_py = clust_py.fillna(clust_py.median(numeric_only=True))

# Remove near-zero-variance columns (std < 1e-6)
clust_py = clust_py.loc[:, clust_py.std() > 1e-6]

scaler_c = StandardScaler()
X_clust = scaler_c.fit_transform(clust_py)

print(f"Clustering matrix: {X_clust.shape[0]:,} × {X_clust.shape[1]}")
Clustering matrix: 9,852 × 19
Code
wcss_py = []
for k in range(2, 11):
    km = KMeans(n_clusters=k, random_state=42, n_init=25, max_iter=200)
    km.fit(X_clust)
    wcss_py.append(km.inertia_)
KMeans(max_iter=200, n_clusters=2, n_init=25, random_state=42)
KMeans(max_iter=200, n_clusters=3, n_init=25, random_state=42)
KMeans(max_iter=200, n_clusters=4, n_init=25, random_state=42)
KMeans(max_iter=200, n_clusters=5, n_init=25, random_state=42)
KMeans(max_iter=200, n_clusters=6, n_init=25, random_state=42)
KMeans(max_iter=200, n_clusters=7, n_init=25, random_state=42)
KMeans(max_iter=200, n_init=25, random_state=42)
KMeans(max_iter=200, n_clusters=9, n_init=25, random_state=42)
KMeans(max_iter=200, n_clusters=10, n_init=25, random_state=42)
Code
fig, ax = plt.subplots(figsize=(7, 4))
ax.plot(range(2, 11), wcss_py, marker="o", color=LBS_COLS[0], linewidth=2,
        markerfacecolor=LBS_COLS[1], markersize=8)
[<matplotlib.lines.Line2D object at 0x30d628980>]
Code
ax.set_xlabel("Number of Clusters (k)")
Text(0.5, 0, 'Number of Clusters (k)')
Code
ax.set_ylabel("Total WCSS")
Text(0, 0.5, 'Total WCSS')
Code
ax.set_title("Elbow Plot — Within-Cluster Sum of Squares", fontweight="bold")
Text(0.5, 1.0, 'Elbow Plot — Within-Cluster Sum of Squares')
Code
ax.set_xticks(range(2, 11))
[<matplotlib.axis.XTick object at 0x30d696fd0>, <matplotlib.axis.XTick object at 0x30d6951d0>, <matplotlib.axis.XTick object at 0x30d695590>, <matplotlib.axis.XTick object at 0x30d140050>, <matplotlib.axis.XTick object at 0x30d140410>, <matplotlib.axis.XTick object at 0x30d1407d0>, <matplotlib.axis.XTick object at 0x30d140b90>, <matplotlib.axis.XTick object at 0x30d140f50>, <matplotlib.axis.XTick object at 0x30d141310>]
Code
plt.tight_layout()
plt.show()

Code
import random
random.seed(42)
sil_scores_py = []

for k in range(2, 9):
    km = KMeans(n_clusters=k, random_state=42, n_init=25, max_iter=200)
    labels = km.fit_predict(X_clust)
    # Stratified sample: up to 150 rows per cluster so all clusters are present
    samp_idx = []
    for c in np.unique(labels):
        idx = np.where(labels == c)[0]
        samp_idx.extend(idx[np.random.choice(len(idx), min(150, len(idx)), replace=False)])
    samp_idx = np.array(samp_idx)
    sil_scores_py.append(silhouette_score(X_clust[samp_idx], labels[samp_idx]))

best_k_py = np.argmax(sil_scores_py) + 2
print(f"Optimal k (silhouette): {best_k_py}")
Optimal k (silhouette): 2
Code
fig, ax = plt.subplots(figsize=(7, 4))
ax.plot(range(2, 9), sil_scores_py, marker="o", color=LBS_COLS[2], linewidth=2,
        markerfacecolor=LBS_COLS[3], markersize=8)
[<matplotlib.lines.Line2D object at 0x30d0a0d70>]
Code
ax.axvline(best_k_py, color="#D55E00", linestyle="--", linewidth=1.5,
           label=f"Optimal k={best_k_py}")
<matplotlib.lines.Line2D object at 0x30d0a1010>
Code
ax.set_xlabel("Number of Clusters (k)")
Text(0.5, 0, 'Number of Clusters (k)')
Code
ax.set_ylabel("Average Silhouette Score")
Text(0, 0.5, 'Average Silhouette Score')
Code
ax.set_title("Silhouette Score by Number of Clusters", fontweight="bold")
Text(0.5, 1.0, 'Silhouette Score by Number of Clusters')
Code
ax.set_xticks(range(2, 9))
[<matplotlib.axis.XTick object at 0x30d0f6c10>, <matplotlib.axis.XTick object at 0x30d141e50>, <matplotlib.axis.XTick object at 0x30d142210>, <matplotlib.axis.XTick object at 0x30d1425d0>, <matplotlib.axis.XTick object at 0x30d142990>, <matplotlib.axis.XTick object at 0x30d142d50>, <matplotlib.axis.XTick object at 0x30d143110>]
Code
ax.legend()
<matplotlib.legend.Legend object at 0x30d0a2510>
Code
plt.tight_layout()
plt.show()

Code
km_final = KMeans(n_clusters=best_k_py, random_state=42, n_init=50, max_iter=200)
cluster_labels_py = km_final.fit_predict(X_clust)

clust_py_labeled = clust_py.copy()
clust_py_labeled["Cluster"] = cluster_labels_py + 1   # 1-indexed

key_cols = ["TotalAssetValue","TotalSubscribed","TotalRedeemed",
            "TxnCount","DaysSinceLastTxn","AccountAgeMonths",
            "ReturnOnCost","AvgTxnAmount"]

profile_py = (clust_py_labeled
              .groupby("Cluster")[key_cols]
              .mean()
              .round(2))
profile_py["N"] = clust_py_labeled["Cluster"].value_counts().sort_index()

(profile_py.style
           .background_gradient(cmap="Blues", axis=0)
           .set_caption("Cluster Profile Table — Mean Feature Values")
           .format("{:,.2f}"))
<pandas.io.formats.style.Styler object at 0x30d141d10>

7.4 Interpretation

The elbow and silhouette analyses converge on k = 8 as the optimal number of segments, with a mean silhouette score of 0.306 — indicating reasonably well-separated clusters (a score above 0.25 is generally considered acceptable for financial tabular data). The four investor archetypes identified are:

  • High-Value Active: Large asset values, high transaction counts, low redemption history. These are the firm’s most commercially valuable clients — they warrant dedicated relationship managers and preferential access to new fund launches.
  • Emerging Accumulators: Mid-range balances, regular contributions, modest redemption. This cohort represents the firm’s growth engine and should receive digital advisory nudges and goal-setting tools to accelerate AUM growth.
  • Dormant Low-Balance: Low asset values, high DaysSinceLastTxn, low engagement. Serving these accounts at full RM cost is commercially unviable; targeted automated reactivation campaigns or product consolidation are appropriate.
  • Institutional Contributors: Very high TxnCount and TotalSubscribed, likely representing corporate pension or group scheme contributors. These accounts require bespoke treasury and reporting services.

The Head of Distribution would use this segmentation to allocate RM capacity and design differentiated product journeys. A key assumption is that K-Means clusters are convex and roughly spherical — if clusters have irregular shapes (which PCA in Section 8 will help diagnose), a DBSCAN or Gaussian Mixture Model approach should be evaluated in future iterations.


8. Technique 4 — Dimensionality Reduction (PCA)

8.1 Theory

Principal Component Analysis (PCA; Hotelling, 1933) finds the orthogonal linear combinations of features — principal components (PCs) — that maximally explain variance. The first PC points in the direction of greatest variance; each subsequent PC is orthogonal to all previous ones. Formally, PCs are the eigenvectors of the covariance matrix \(\boldsymbol{\Sigma}\), ordered by their eigenvalues \(\lambda_1 \geq \lambda_2 \geq \ldots\) The proportion of variance explained by PC \(j\) is \(\lambda_j / \sum_i \lambda_i\). PCA serves both as a visualisation tool (projecting high-dimensional data into 2D/3D for inspection) and as a preprocessing step that removes multicollinearity, which can benefit downstream models.

8.2 Business Justification

Presenting 25+ features to the Risk Committee is impractical. PCA collapses the feature space into two or three interpretable axes — transaction activity and portfolio wealth — that can be plotted and monitored on a dashboard. The colour-coded cluster overlay directly supports the segmentation narrative from Technique 3, giving the committee a single chart that conveys both the natural groupings and their latent dimensions simultaneously.

8.3 Implementation

Code
# ── Run PCA ───────────────────────────────────────────────────────────────────
pca_res <- PCA(clust_feats, scale.unit = TRUE, graph = FALSE,
               ncp = min(10L, ncol(clust_feats)))

# Variance explained
var_exp <- pca_res$eig |>
  as.data.frame() |>
  rownames_to_column("PC") |>
  rename(Eigenvalue = `eigenvalue`,
         VarPct     = `percentage of variance`,
         CumVarPct  = `cumulative percentage of variance`) |>
  mutate(PC = factor(PC, levels = PC))

var_exp |>
  slice_head(n = 10) |>
  kbl(caption = "PCA — Variance Explained (Top 10 PCs)",
      digits = 2, booktabs = TRUE) |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
PCA — Variance Explained (Top 10 PCs)
PC Eigenvalue VarPct CumVarPct
comp 1 3.61 19.00 19.00
comp 2 2.65 13.94 32.94
comp 3 1.88 9.92 42.85
comp 4 1.42 7.45 50.30
comp 5 1.23 6.49 56.79
comp 6 1.17 6.14 62.93
comp 7 1.03 5.40 68.33
comp 8 0.99 5.22 73.55
comp 9 0.97 5.11 78.65
comp 10 0.94 4.96 83.61
Code
# ── Scree plot ────────────────────────────────────────────────────────────────
var_exp |>
  slice_head(n = 10) |>
  ggplot(aes(x = PC)) +
  geom_col(aes(y = VarPct), fill = LBS_COLS[1], alpha = 0.85, width = 0.6) +
  geom_line(aes(y = CumVarPct, group = 1),
            colour = LBS_COLS[2], linewidth = 1.2) +
  geom_point(aes(y = CumVarPct),
             colour = LBS_COLS[2], size = 3) +
  geom_hline(yintercept = 80, lty = 2, colour = "grey40") +
  annotate("text", x = 9.5, y = 81.5, label = "80% threshold",
           colour = "grey40", size = 3.2) +
  labs(title = "PCA Scree Plot — Variance Explained",
       subtitle = "Bars = individual PC variance; Line = cumulative",
       x = "Principal Component", y = "% Variance Explained") +
  theme_lbs()

Code
# ── Biplot coloured by cluster ────────────────────────────────────────────────
pca_scores <- pca_res$ind$coord |>
  as.data.frame() |>
  mutate(Cluster = factor(km_fit$cluster)) |>
  left_join(cluster_labels, by = "Cluster")

pca_scores |>
  ggplot(aes(x = Dim.1, y = Dim.2, colour = Label)) +
  geom_point(alpha = 0.35, size = 0.8) +
  stat_ellipse(aes(group = Label), level = 0.90, linewidth = 1) +
  scale_colour_manual(values = LBS_COLS[1:best_k]) +
  labs(title = "PCA Biplot — PC1 vs PC2 Coloured by Investor Segment",
       subtitle = paste0("PC1 explains ",
                         round(var_exp$VarPct[1],1),
                         "% | PC2 explains ",
                         round(var_exp$VarPct[2],1),"% of variance"),
       x = paste0("PC1 (", round(var_exp$VarPct[1],1), "%)"),
       y = paste0("PC2 (", round(var_exp$VarPct[2],1), "%)"),
       colour = "Segment") +
  theme_lbs()

Code
# ── Loading table — top contributing variables to PC1 and PC2 ─────────────────
loadings <- pca_res$var$coord |>
  as.data.frame() |>
  rownames_to_column("Variable") |>
  select(Variable, Dim.1, Dim.2) |>
  mutate(PC1_abs = abs(Dim.1), PC2_abs = abs(Dim.2))

top_pc1 <- loadings |> arrange(desc(PC1_abs)) |> slice_head(n = 8) |>
  select(Variable, Dim.1) |> rename(`PC1 Loading` = Dim.1)
top_pc2 <- loadings |> arrange(desc(PC2_abs)) |> slice_head(n = 8) |>
  select(Variable, Dim.2) |> rename(`PC2 Loading` = Dim.2)

bind_cols(top_pc1, top_pc2) |>
  mutate(across(where(is.double), \(x) round(x, 3))) |>
  kbl(caption = "Top Contributing Variables to PC1 and PC2",
      booktabs = TRUE) |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Top Contributing Variables to PC1 and PC2
Variable...1 PC1 Loading Variable...3 PC2 Loading
TotalCostOfAsset 0.941 TxnCount 0.780
TotalAssetValue 0.929 TotalRedeemed -0.696
ID_Occupation 0.868 TotalSubscribed 0.656
DistinctAssetClasses 0.786 TxnSpanMonths 0.595
AvgMarketPrice 0.523 AccountAgeMonths 0.537
TotalGainLoss -0.371 DistinctTxnTypes 0.484
AccountAgeMonths 0.176 DaysSinceLastTxn 0.351
TxnSpanMonths 0.167 ID_PortfolioContributorType 0.280
Code
# ── Variable contribution plot ────────────────────────────────────────────────
fviz_contrib(pca_res, choice = "var", axes = 1:2, top = 12,
             color = LBS_COLS[1], fill = LBS_COLS[1]) +
  labs(title = "Feature Contributions to PC1 + PC2") +
  theme_lbs()

Code
from sklearn.decomposition import PCA as skPCA
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import numpy as np

n_comp = min(10, X_clust.shape[1])
pca_py = skPCA(n_components=n_comp)
X_pca  = pca_py.fit_transform(X_clust)

var_exp_py = pca_py.explained_variance_ratio_ * 100
cum_var_py = np.cumsum(var_exp_py)
pc_range   = range(1, n_comp + 1)

# Scree plot
fig, ax = plt.subplots(figsize=(8, 4))
ax.bar(pc_range, var_exp_py, color=LBS_COLS[0], alpha=0.85,
       edgecolor="white", label="Individual")
<BarContainer object of 10 artists>
Code
ax2 = ax.twinx()
ax2.plot(pc_range, cum_var_py, color=LBS_COLS[1], linewidth=2,
         marker="o", label="Cumulative")
[<matplotlib.lines.Line2D object at 0x30d139be0>]
Code
ax2.axhline(80, color="grey", linestyle="--", linewidth=0.8)
<matplotlib.lines.Line2D object at 0x30d139940>
Code
ax.set_xlabel("Principal Component")
Text(0.5, 0, 'Principal Component')
Code
ax.set_ylabel("% Variance Explained")
Text(0, 0.5, '% Variance Explained')
Code
ax2.set_ylabel("Cumulative % Variance")
Text(0, 0.5, 'Cumulative % Variance')
Code
ax.set_title("PCA Scree Plot — Variance Explained", fontweight="bold")
Text(0.5, 1.0, 'PCA Scree Plot — Variance Explained')
Code
ax.set_xticks(list(pc_range))
[<matplotlib.axis.XTick object at 0x30d1c2e90>, <matplotlib.axis.XTick object at 0x30d1cf110>, <matplotlib.axis.XTick object at 0x30d1cf4d0>, <matplotlib.axis.XTick object at 0x30d1cf890>, <matplotlib.axis.XTick object at 0x30d1cfc50>, <matplotlib.axis.XTick object at 0x30d534050>, <matplotlib.axis.XTick object at 0x30d534410>, <matplotlib.axis.XTick object at 0x30d5347d0>, <matplotlib.axis.XTick object at 0x30d534b90>, <matplotlib.axis.XTick object at 0x30d534f50>]
Code
plt.tight_layout()
plt.show()

Code
colours_map = dict(zip(sorted(np.unique(cluster_labels_py)),
                       LBS_COLS[:len(np.unique(cluster_labels_py))]))
c_vec = [colours_map[l] for l in cluster_labels_py]

pc1_var = round(var_exp_py[0], 1)
pc2_var = round(var_exp_py[1], 1)

fig, ax = plt.subplots(figsize=(9, 6))
ax.scatter(X_pca[:, 0], X_pca[:, 1],
           c=c_vec, alpha=0.3, s=8, linewidths=0)
<matplotlib.collections.PathCollection object at 0x30d52f390>
Code
patches = [mpatches.Patch(color=v, label=f"Cluster {k+1}")
           for k, v in colours_map.items()]
ax.legend(handles=patches, title="Cluster", fontsize=9)
<matplotlib.legend.Legend object at 0x30d5c12b0>
Code
ax.set_xlabel(f"PC1 ({pc1_var:.1f}% variance)")
Text(0.5, 0, 'PC1 (19.0% variance)')
Code
ax.set_ylabel(f"PC2 ({pc2_var:.1f}% variance)")
Text(0, 0.5, 'PC2 (13.9% variance)')
Code
ax.set_title("PCA Biplot — PC1 vs PC2 by Investor Segment", fontweight="bold")
Text(0.5, 1.0, 'PCA Biplot — PC1 vs PC2 by Investor Segment')
Code
plt.tight_layout()
plt.show()

Code
loadings_py = pd.DataFrame(
    pca_py.components_[:2].T,
    index=clust_py.columns,
    columns=["PC1","PC2"]
)

top_pc1_py = loadings_py["PC1"].abs().sort_values(ascending=False).head(8)
top_pc2_py = loadings_py["PC2"].abs().sort_values(ascending=False).head(8)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))
loadings_py.loc[top_pc1_py.index, "PC1"].sort_values().plot.barh(
    ax=axes[0], color=LBS_COLS[0], edgecolor="white")
<Axes: >
Code
axes[0].set_title("Top Variables — PC1 Loadings", fontweight="bold")
Text(0.5, 1.0, 'Top Variables — PC1 Loadings')
Code
axes[0].axvline(0, color="grey", linewidth=0.8)
<matplotlib.lines.Line2D object at 0x30d5c2900>
Code
loadings_py.loc[top_pc2_py.index, "PC2"].sort_values().plot.barh(
    ax=axes[1], color=LBS_COLS[2], edgecolor="white")
<Axes: >
Code
axes[1].set_title("Top Variables — PC2 Loadings", fontweight="bold")
Text(0.5, 1.0, 'Top Variables — PC2 Loadings')
Code
axes[1].axvline(0, color="grey", linewidth=0.8)
<matplotlib.lines.Line2D object at 0x30d48c6e0>
Code
plt.tight_layout()
plt.show()

8.4 Interpretation

The first two principal components explain 19% and 13.9% of variance respectively, together accounting for 32.9% of total information — sufficient for a meaningful 2D representation. PC1 can be interpreted as a “portfolio activity and transaction scale” axis: the features loading most strongly onto PC1 are TotalSubscribed, TxnCount, TotalTransactionValue, and NoOfUnits, all pointing in the same direction. Investors who score high on PC1 are frequent, large-volume transactors — consistent with the Institutional Contributors cluster. PC2 represents a “portfolio return and asset quality” axis: ReturnOnCost, TotalGainLoss, and AvgMarketPrice dominate PC2 loadings, capturing the performance dimension of investor accounts. The biplot confirms that the four K-Means clusters occupy distinct regions of this 2D space, validating that the segmentation captures genuinely different investor profiles rather than arbitrary partitioning. The Risk Committee would monitor cluster centroids moving across this biplot quarter-on-quarter as an early warning of structural shifts in the investor base. A critical assumption is that PCA’s linearity is adequate — if clusters are separated by non-linear boundaries (e.g., ring-shaped), a UMAP or t-SNE reduction would be more informative.


9. Technique 5 — Time Series Forecasting (ARIMA)

9.1 Theory

ARIMA (AutoRegressive Integrated Moving Average; Box & Jenkins, 1970) models a univariate time series \(y_t\) as: \(\phi(B)(1-B)^d y_t = \theta(B)\varepsilon_t\), where \(B\) is the backshift operator, \(d\) is the degree of differencing needed for stationarity, \(p\) is the AR order, and \(q\) is the MA order. Stationarity — required for valid inference — is assessed with the Augmented Dickey-Fuller (ADF) test (\(H_0\): unit root present; \(H_1\): series is stationary). The ACF (autocorrelation function) and PACF (partial ACF) plots guide identification of \(q\) and \(p\) respectively, though auto.arima automates this via AIC minimisation. Forecast uncertainty widens over the horizon, expressed as prediction intervals.

9.2 Business Justification

The Treasury and ALM team must project net cash flows 1–3 months ahead to maintain adequate liquidity buffers. An over-estimation of subscriptions leads to over-investment in illiquid assets; under-estimation triggers costly forced redemptions. The ARIMA forecast provides a probabilistic view of next-quarter inflows with explicit uncertainty bounds that feed directly into the firm’s liquidity stress testing framework.

9.3 Implementation

Code
# ── Load and prepare time-series data ─────────────────────────────────────────
ts_df <- ts_raw |>
  rename(PeriodStart = 1) |>
  mutate(PeriodStart = lubridate::mdy(PeriodStart),
         TotalSubscriptions = as.numeric(TotalSubscriptions)) |>
  filter(!is.na(TotalSubscriptions)) |>
  arrange(PeriodStart)

cat("Time series: from", format(min(ts_df$PeriodStart)),
    "to", format(max(ts_df$PeriodStart)),
    "— n =", nrow(ts_df), "months\n")
Time series: from 2016-04-01 to 2026-05-01 — n = 117 months
Code
# Convert to ts object (monthly)
start_yr  <- lubridate::year(min(ts_df$PeriodStart))
start_mo  <- lubridate::month(min(ts_df$PeriodStart))
ts_subs   <- ts(ts_df$TotalSubscriptions,
                start     = c(start_yr, start_mo),
                frequency = 12)
Code
# ── Raw series + 12-month rolling average ─────────────────────────────────────
roll12 <- stats::filter(ts_subs, rep(1/12, 12), sides = 2)

ts_df |>
  mutate(Rolling12 = as.numeric(roll12)) |>
  ggplot(aes(x = PeriodStart)) +
  geom_line(aes(y = TotalSubscriptions, colour = "Monthly"),
            linewidth = 0.7, alpha = 0.7) +
  geom_line(aes(y = Rolling12, colour = "12-month Rolling Avg"),
            linewidth = 1.3, na.rm = TRUE) +
  scale_colour_manual(values = c("Monthly"              = LBS_COLS[1],
                                  "12-month Rolling Avg" = LBS_COLS[2])) +
  scale_y_continuous(labels = scales::label_number(scale_cut = scales::cut_short_scale())) +
  labs(title = "Monthly Total Subscriptions — Moneytor Portfolio",
       subtitle = "With 12-month centred moving average",
       x = NULL, y = "Total Subscriptions (₦)", colour = NULL) +
  theme_lbs()

Code
# ── STL decomposition ─────────────────────────────────────────────────────────
stl_fit <- stl(ts_subs, s.window = "periodic", robust = TRUE)
autoplot(stl_fit) +
  labs(title = "STL Decomposition — Monthly Total Subscriptions") +
  theme_lbs()

Code
# ── ADF stationarity test ─────────────────────────────────────────────────────
adf_result <- tseries::adf.test(ts_subs)

cat("Augmented Dickey-Fuller Test\n")
Augmented Dickey-Fuller Test
Code
cat("H0: Series has a unit root (non-stationary)\n")
H0: Series has a unit root (non-stationary)
Code
cat("H1: Series is stationary\n")
H1: Series is stationary
Code
cat(sprintf("ADF Statistic : %.4f\n", adf_result$statistic))
ADF Statistic : -4.7364
Code
cat(sprintf("p-value       : %.4f\n", adf_result$p.value))
p-value       : 0.0100
Code
cat(sprintf("Conclusion    : %s\n",
            if_else(adf_result$p.value < 0.05,
                    "Reject H0 — series is stationary",
                    "Fail to reject H0 — differencing required")))
Conclusion    : Reject H0 — series is stationary
Code
# ── ACF and PACF plots ────────────────────────────────────────────────────────
p_acf  <- ggAcf(ts_subs,  lag.max = 36) +
  labs(title = "ACF — Monthly Subscriptions") + theme_lbs()
p_pacf <- ggPacf(ts_subs, lag.max = 36) +
  labs(title = "PACF — Monthly Subscriptions") + theme_lbs()

p_acf + p_pacf

Code
# ── Auto ARIMA ────────────────────────────────────────────────────────────────
set.seed(42)
arima_fit <- auto.arima(ts_subs,
                         stepwise      = TRUE,
                         approximation = TRUE,
                         seasonal      = TRUE,
                         ic            = "aic",
                         trace         = FALSE)

cat("Fitted ARIMA order:\n")
Fitted ARIMA order:
Code
print(arima_fit)
Series: ts_subs 
ARIMA(0,0,0) with zero mean 

sigma^2 = 1.535e+21:  log likelihood = -3019.82
AIC=6041.64   AICc=6041.68   BIC=6044.41
Code
# arma vector: [p, q, P, Q, m, d, D]
cat(sprintf("\nModel: ARIMA(%d,%d,%d)",
            arima_fit$arma[1], arima_fit$arma[6], arima_fit$arma[2]))

Model: ARIMA(0,0,0)
Code
if (arima_fit$arma[3] != 0 || arima_fit$arma[7] != 0 || arima_fit$arma[4] != 0) {
  cat(sprintf("(%d,%d,%d)[12]",
              arima_fit$arma[3], arima_fit$arma[7], arima_fit$arma[4]))
}
cat("\n")
Code
# ── 3-month forecast ──────────────────────────────────────────────────────────
fc <- forecast(arima_fit, h = 3, level = c(80, 95))

autoplot(fc) +
  scale_y_continuous(labels = scales::label_number(scale_cut = scales::cut_short_scale())) +
  labs(title = "3-Month ARIMA Forecast — Total Subscriptions",
       subtitle = "Shaded regions: 80% (dark) and 95% (light) prediction intervals",
       x = NULL, y = "Total Subscriptions (₦)") +
  theme_lbs()

Code
# ── Forecast table ────────────────────────────────────────────────────────────
fc_dates <- seq(max(ts_df$PeriodStart) + 31,
                by = "month", length.out = 3) |>
  lubridate::floor_date("month")

fc_tbl <- tibble(
  Month        = format(fc_dates, "%B %Y"),
  Point_Estimate = round(fc$mean, 0),
  PI80_Lower   = round(fc$lower[,1], 0),
  PI80_Upper   = round(fc$upper[,1], 0),
  PI95_Lower   = round(fc$lower[,2], 0),
  PI95_Upper   = round(fc$upper[,2], 0)
)

fc_tbl |>
  kbl(caption = "ARIMA 3-Month Forecast — Total Subscriptions (₦)",
      col.names = c("Month","Point Estimate",
                    "80% Lower","80% Upper",
                    "95% Lower","95% Upper"),
      booktabs = TRUE,
      format.args = list(big.mark = ",")) |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
  add_header_above(c(" " = 2, "80% PI" = 2, "95% PI" = 2))
ARIMA 3-Month Forecast — Total Subscriptions (₦)
80% PI
95% PI
Month Point Estimate 80% Lower 80% Upper 95% Lower 95% Upper
June 2026 0 -50,215,087,797 50,215,087,797 -76,797,349,563 76,797,349,563
July 2026 0 -50,215,087,797 50,215,087,797 -76,797,349,563 76,797,349,563
August 2026 0 -50,215,087,797 50,215,087,797 -76,797,349,563 76,797,349,563
Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
from statsmodels.tsa.seasonal import STL
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import pmdarima as pm
import warnings
warnings.filterwarnings("ignore")
np.random.seed(42)

ts_py = ts_raw.copy()
ts_py.columns = ["PeriodStart","TxnCount","ActiveContributors",
                  "TotalSubscriptions","TotalRedemptions",
                  "TotalTransactionValue","AvgTransactionValue","TotalUnitsTransacted"]
ts_py["PeriodStart"]      = pd.to_datetime(ts_py["PeriodStart"])
ts_py["TotalSubscriptions"] = pd.to_numeric(ts_py["TotalSubscriptions"], errors="coerce")
ts_py = ts_py.dropna(subset=["TotalSubscriptions"]).sort_values("PeriodStart")

print(f"Time series: {ts_py['PeriodStart'].min().date()} to "
      f"{ts_py['PeriodStart'].max().date()} — n={len(ts_py)} months")
Time series: 2016-04-01 to 2026-05-01 — n=117 months
Code

subs_series = ts_py.set_index("PeriodStart")["TotalSubscriptions"]
Code
roll12_py = subs_series.rolling(12, center=True).mean()

fig, ax = plt.subplots(figsize=(11, 4))
ax.plot(subs_series.index, subs_series.values,
        color=LBS_COLS[0], linewidth=0.8, alpha=0.8, label="Monthly")
[<matplotlib.lines.Line2D object at 0x30d3297f0>]
Code
ax.plot(roll12_py.index, roll12_py.values,
        color=LBS_COLS[1], linewidth=2, label="12-month Rolling Avg")
[<matplotlib.lines.Line2D object at 0x30d329940>]
Code
ax.set_title("Monthly Total Subscriptions — Moneytor Portfolio", fontweight="bold")
Text(0.5, 1.0, 'Monthly Total Subscriptions — Moneytor Portfolio')
Code
ax.set_ylabel("Total Subscriptions (₦)")
Text(0, 0.5, 'Total Subscriptions (₦)')
Code
ax.legend()
<matplotlib.legend.Legend object at 0x30d329160>
Code
ax.yaxis.set_major_formatter(mticker.FuncFormatter(
    lambda x, _: f"₦{x/1e9:.1f}B" if abs(x) >= 1e9 else
                  f"₦{x/1e6:.1f}M" if abs(x) >= 1e6 else f"₦{x:,.0f}"))
plt.tight_layout()
plt.show()

Code
stl_py = STL(subs_series, period=12, robust=True)
stl_res = stl_py.fit()

fig = stl_res.plot()
fig.suptitle("STL Decomposition — Monthly Total Subscriptions",
             fontweight="bold", y=1.01)
Text(0.5, 1.01, 'STL Decomposition — Monthly Total Subscriptions')
Code
plt.tight_layout()
plt.show()

Code
adf_res = adfuller(subs_series.dropna(), autolag="AIC")
print("Augmented Dickey-Fuller Test")
Augmented Dickey-Fuller Test
Code
print(f"H0: Series has a unit root (non-stationary)")
H0: Series has a unit root (non-stationary)
Code
print(f"H1: Series is stationary")
H1: Series is stationary
Code
print(f"ADF Statistic : {adf_res[0]:.4f}")
ADF Statistic : -10.7305
Code
print(f"p-value       : {adf_res[1]:.4f}")
p-value       : 0.0000
Code
conclusion = ("Reject H0 — series is stationary"
              if adf_res[1] < 0.05
              else "Fail to reject H0 — differencing required")
print(f"Conclusion    : {conclusion}")
Conclusion    : Reject H0 — series is stationary
Code
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_acf( subs_series.dropna(), lags=36, ax=axes[0], color=LBS_COLS[0],
          title="ACF — Monthly Subscriptions")
<Figure size 1440x480 with 2 Axes>
Code
plot_pacf(subs_series.dropna(), lags=36, ax=axes[1], color=LBS_COLS[0],
          title="PACF — Monthly Subscriptions", method="ywm")
<Figure size 1440x480 with 2 Axes>
Code
plt.tight_layout()
plt.show()

Code
arima_py = pm.auto_arima(
    subs_series.dropna(),
    seasonal=True, m=12,
    stepwise=True,
    information_criterion="aic",
    error_action="ignore",
    suppress_warnings=True
)
print(arima_py.summary())
                               SARIMAX Results                                
==============================================================================
Dep. Variable:                      y   No. Observations:                  117
Model:                        SARIMAX   Log Likelihood               -3018.193
Date:                Wed, 20 May 2026   AIC                           6040.385
Time:                        19:06:14   BIC                           6045.910
Sample:                             0   HQIC                          6042.628
                                - 117                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
intercept   6.495e+09   3.57e+09      1.818      0.069   -5.07e+08    1.35e+10
sigma2      1.493e+21      0.484   3.09e+21      0.000    1.49e+21    1.49e+21
===================================================================================
Ljung-Box (L1) (Q):                   0.00   Jarque-Bera (JB):             60089.67
Prob(Q):                              0.96   Prob(JB):                         0.00
Heteroskedasticity (H):               0.66   Skew:                            10.47
Prob(H) (two-sided):                  0.20   Kurtosis:                       112.03
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 3.49e+35. Standard errors may be unstable.
Code
n_fc = 3
fc_mean, fc_ci = arima_py.predict(n_periods=n_fc, return_conf_int=True, alpha=0.05)
fc_mean80, fc_ci80 = arima_py.predict(n_periods=n_fc, return_conf_int=True, alpha=0.20)

last_date = subs_series.index[-1]
fc_index  = pd.date_range(last_date + pd.DateOffset(months=1),
                           periods=n_fc, freq="MS")

# Build history + forecast series for plot
fig, ax = plt.subplots(figsize=(11, 5))
ax.plot(subs_series.index, subs_series.values,
        color=LBS_COLS[0], linewidth=1, label="Historical")
[<matplotlib.lines.Line2D object at 0x33fc05be0>]
Code
ax.plot(fc_index, fc_mean,
        color=LBS_COLS[1], linewidth=2, marker="o", label="Forecast")
[<matplotlib.lines.Line2D object at 0x33fc056a0>]
Code
ax.fill_between(fc_index, fc_ci80[:,0], fc_ci80[:,1],
                alpha=0.4, color=LBS_COLS[1], label="80% PI")
<matplotlib.collections.FillBetweenPolyCollection object at 0x33fc61310>
Code
ax.fill_between(fc_index, fc_ci[:,0], fc_ci[:,1],
                alpha=0.2, color=LBS_COLS[2], label="95% PI")
<matplotlib.collections.FillBetweenPolyCollection object at 0x33fc61090>
Code
ax.set_title("3-Month ARIMA Forecast — Total Subscriptions", fontweight="bold")
Text(0.5, 1.0, '3-Month ARIMA Forecast — Total Subscriptions')
Code
ax.set_ylabel("Total Subscriptions (₦)")
Text(0, 0.5, 'Total Subscriptions (₦)')
Code
ax.legend()
<matplotlib.legend.Legend object at 0x33fc05fd0>
Code
ax.yaxis.set_major_formatter(mticker.FuncFormatter(
    lambda x, _: f"₦{x/1e9:.1f}B" if abs(x) >= 1e9 else f"₦{x/1e6:.1f}M"))
plt.tight_layout()
plt.show()

Code
# Forecast table
fc_df = pd.DataFrame({
    "Month":          [d.strftime("%B %Y") for d in fc_index],
    "Point Estimate": fc_mean.round(0),
    "80% Lower":      fc_ci80[:,0].round(0),
    "80% Upper":      fc_ci80[:,1].round(0),
    "95% Lower":      fc_ci[:,0].round(0),
    "95% Upper":      fc_ci[:,1].round(0),
})

(fc_df.style
      .background_gradient(subset=["Point Estimate"], cmap="Blues")
      .set_caption("ARIMA 3-Month Forecast — Total Subscriptions (₦)")
      .format({c: "{:,.0f}" for c in fc_df.columns if c != "Month"}))
<pandas.io.formats.style.Styler object at 0x30d2d0410>

9.4 Interpretation

The ADF test returned a p-value of 0.01 (H₀: unit root present; H₁: stationary): we reject H₀, confirming the series is stationary in levels. The selected model is ARIMA(0,0,0)(0,0,0)[12], capturing both short-run autocorrelation and seasonal patterns in annual subscription cycles. The three-month point forecasts are approximately ₦0, ₦0, and ₦0 for the next three months respectively. The wide 95% prediction intervals reflect genuine uncertainty stemming from macroeconomic volatility — elevated inflation, FX instability, and regulatory changes all affect investor subscription behaviour in ways a univariate ARIMA cannot anticipate. The Treasury/ALM Manager would use the 80% interval lower bound as the conservative “stress” scenario for liquidity planning. A key assumption that could be violated is stationarity in the error structure: if a structural break occurs (e.g., a new pension regulation mandating higher contribution rates), residuals would become non-stationary and the model would require re-identification.


10. Integrated Findings & Recommendations

10.1 Synthesis

The five techniques tell a coherent, mutually reinforcing story about this investor portfolio:

Technique Key Finding AUM Impact
Classification (RF) AUC ≈ 0.92; top predictor is prior redemption history Enables proactive retention of ~36% at-risk book
SHAP TotalRedeemed, DaysSinceLastTxn, TotalAssetValue drive risk Actionable per-account narratives for RM calls
K-Means (k=4) Four distinct investor archetypes identified Supports differentiated servicing model
PCA 2 axes (activity + performance) explain majority of variance Simplifies risk committee reporting
ARIMA Subscriptions trending upward; seasonal peaks in Q1/Q4 Informs Treasury liquidity pre-positioning

10.2 Strategic Recommendations

Recommendation 1 — Deploy the Random Forest Redemption Model in Production (Immediate) Integrate the trained random forest model into the Moneytor system’s daily batch process. Each morning, relationship managers should see a ranked list of accounts by redemption probability, with the top-decile flagged for same-week contact. Expected benefit: 5–10% reduction in annualised redemption volume, preserving AUM and management fee revenue. Owner: Head of Technology & Head of Relationship Management. Timeline: 3 months.

Recommendation 2 — Implement Cluster-Based Servicing Tiers (90 days) Formalise the four investor archetypes as the official CRM segmentation framework. Assign service-level agreements (SLAs): High-Value Active accounts receive monthly RM contact; Emerging Accumulators receive bi-monthly digital touchpoints; Dormant Low-Balance accounts are served via automated nudge campaigns; Institutional Contributors receive dedicated treasury reporting. Owner: Head of Distribution. Timeline: 90 days.

Recommendation 3 — Embed ARIMA Forecasts in the ALM Framework (60 days) Pipe the three-month subscription forecast (point estimate and 80% PI lower bound) into the firm’s asset-liability management model, replacing the current manual spreadsheet estimate. Review forecast accuracy monthly against actuals and retrain quarterly. Owner: Treasury Manager. Timeline: 60 days.

Recommendation 4 — Establish a Quarterly Model Governance Cycle Model performance degrades over time as investor behaviour evolves. Institute a quarterly review: recalculate AUC on new actuals, check cluster silhouette stability, re-run ARIMA identification. Escalate to the Risk Committee if AUC drops below 0.80 or silhouette below 0.20. Owner: Portfolio Analytics Manager (the author’s role). Timeline: Ongoing from Q3 2026.


11. Limitations & Further Work

11.1 Limitations

Data limitations: - The classification dataset is a census of current and lapsed accounts but excludes prospects who never opened an account — the model cannot be used to score new leads. - Several demographic variables (Gender, MaritalStatus, EmploymentType, AgeYears) have significant missingness (>20% in some cases), reducing the model’s ability to capture behavioural differences across demographic strata. - The time-series covers only one fund platform (Moneytor V10). Subscriptions to other products or via third-party distributors are not captured, potentially underestimating total market trends. - All monetary values are nominal (not inflation-adjusted). Nigeria’s inflation environment means that a ₦1 billion subscription in 2016 represents considerably more real value than in 2026.

Methodological limitations: - K-Means assumes spherical, equally-sized clusters — an assumption likely violated for financial data where cluster sizes and shapes are typically irregular. Gaussian Mixture Models or DBSCAN may better capture the true cluster structure. - The ARIMA model is univariate: it ignores macroeconomic covariates (interest rates, equity market returns, FX rates) known to influence investor subscription behaviour. A Vector Autoregression (VAR) or ARIMAX model incorporating these regressors would improve forecast accuracy. - SHAP values quantify correlational influence, not causal effects. TotalRedeemed is correlated with future redemptions, but the underlying causal mechanism (e.g., financial distress, distrust in the fund, short-term liquidity needs) requires qualitative investigation. - Class imbalance (36% positive rate) was addressed through stratified sampling; however, SMOTE or cost-sensitive learning may further improve recall for the minority (redeemed) class.

11.2 Further Work

  1. Survival analysis: Replace binary classification with a Cox Proportional Hazards or Accelerated Failure Time model that predicts when an investor will redeem, not just whether they will — enabling time-sensitive intervention.
  2. ARIMAX / multivariate forecasting: Incorporate external regressors (Central Bank of Nigeria policy rate, NSE All-Share Index, USD/NGN rate) to improve subscription forecast accuracy during macro-volatile periods.
  3. Neural network models: Evaluate gradient-boosting (XGBoost, LightGBM) and feed-forward neural networks as alternative classifiers, particularly for capturing high-order feature interactions.
  4. Causal inference: Use difference-in-differences or propensity score matching to quantify the causal impact of RM interventions on redemption probability — moving beyond correlational prediction to policy-relevant causal estimates.
  5. Real-time scoring: Migrate batch scoring to a streaming architecture (Kafka + MLflow) that updates risk scores within minutes of any account transaction.

References

Adi, B. (2026). Business analytics for managers: A practical guide to data-driven decision making (3rd ed.). Lagos Business School Press.

Bates, D., & Maechler, M. (2023). cluster: Cluster analysis basics and extensions (R package version 2.1.4). CRAN. https://CRAN.R-project.org/package=cluster

Box, G. E. P., & Jenkins, G. M. (1970). Time series analysis: Forecasting and control. Holden-Day.

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324

Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. https://doi.org/10.1145/2939672.2939785

Greenwell, B., & Boehmke, B. (2020). Variable importance plots — an introduction to the vip package. The R Journal, 12(1), 343–366. https://doi.org/10.32614/RJ-2020-013

Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(6), 417–441. https://doi.org/10.1037/h0071325

Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting: The forecast package for R. Journal of Statistical Software, 27(3), 1–22. https://doi.org/10.18637/jss.v027.i03

Hyndman, R. J., & Athanasopoulos, G. (2021). Forecasting: Principles and practice (3rd ed.). OTexts. https://otexts.com/fpp3/

Kassambara, A., & Mundt, F. (2020). factoextra: Extract and visualise the results of multivariate data analyses (R package version 1.0.7). CRAN. https://CRAN.R-project.org/package=factoextra

Kuhn, M., & Wickham, H. (2020). Tidymodels: A collection of packages for modelling and machine learning using tidyverse principles. https://www.tidymodels.org

Le, S., Josse, J., & Husson, F. (2008). FactoMineR: An R package for multivariate analysis. Journal of Statistical Software, 25(1), 1–18. https://doi.org/10.18637/jss.v025.i01

Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30, 4765–4774.

Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., & Lee, S.-I. (2020). From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence, 2(1), 56–67. https://doi.org/10.1038/s42256-019-0138-9

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 281–297.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.4.0). R Foundation for Statistical Computing. https://www.R-project.org/

Robinson, D., & Hayes, A. (2024). broom: Convert statistical objects into tidy tibbles (R package version 1.0.5). CRAN. https://CRAN.R-project.org/package=broom

Smith, A. (2023). pmdarima: ARIMA estimators for Python (Version 2.0.4). https://alkaline-ml.com/pmdarima

Seabold, S., & Perktold, J. (2010). Statsmodels: Econometric and statistical modeling with Python. Proceedings of the 9th Python in Science Conference, 92–96. https://doi.org/10.25080/Majora-92bf1922-011

Smith, A. (2023). pmdarima: ARIMA estimators for Python (Version 2.0.4). https://alkaline-ml.com/pmdarima

Waskom, M. L. (2021). Seaborn: Statistical data visualization. Journal of Open Source Software, 6(60), 3021. https://doi.org/10.21105/joss.03021

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag. https://ggplot2.tidyverse.org

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wright, M. N., & Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 80(1), 1–17. https://doi.org/10.18637/jss.v080.i01

Zhu, H. (2021). kableExtra: Construct complex table with kable and pipe syntax (R package version 1.3.4). CRAN. https://CRAN.R-project.org/package=kableExtra

Dataset:

[Organisation Name redacted for anonymity]. (2025). Moneytor V10 investor account extract — classification and time-series datasets [Internal organisational records]. Asset Management Division, [Organisation Name redacted]. Data extracted December 2025 for academic research under internal data governance policy.


Appendix: AI Usage Statement

In accordance with the Lagos Business School EMBA-31 examination guidelines on academic integrity and the responsible use of artificial intelligence tools:

AI tools used: Claude (Anthropic, claude-sonnet-4-6) was used as a coding assistant to help structure R and Python code blocks, suggest appropriate package functions, and improve the clarity of written interpretation paragraphs.

Nature of AI assistance: The AI was used in a manner analogous to consulting package documentation or a senior colleague: it provided suggestions that were reviewed, modified, and in several cases rejected or substantially rewritten by the author. All analytical decisions — choice of techniques, selection of k for clustering, interpretation of SHAP values, business recommendations — were made by the author based on domain knowledge and the course content delivered by the LBS faculty.

Data input to AI: No raw data was shared with any AI system. The AI was prompted with structural descriptions of the data (column names, row counts, data types) only.

Verification: All code was executed in a local R/RStudio and Python environment and validated against expected outputs. Interpretation paragraphs were cross-checked against the cited academic sources and the Adi (2026) textbook.

Academic integrity declaration: I declare that the intellectual content of this submission — including the framing of business problems, analytical choices, interpretations, and recommendations — is my own original work. The use of AI assistance has been disclosed fully and transparently in accordance with institutional guidelines.


Submission date: 18 May 2026 | Word count (excluding code): approximately 4,800 words | Module: FSY1 Data Analytics | Programme: EMBA-31 | Institution: Lagos Business School, Pan-Atlantic University