1. Executive Summary

This case study applies five machine learning and statistical techniques to investor data drawn from Moneytor V10, the portfolio management system of a Nigerian life insurance and asset management organisation. Using a cross-sectional dataset of 9,851 investor accounts and a monthly time-series spanning 2016–2026, the analysis addresses two core business problems: identifying which investors are at risk of redeeming (liquidating) their fund units, and uncovering natural investor segments to support targeted relationship management.

A Random Forest classifier achieved an out-of-sample AUC of approximately 0.92, substantially outperforming logistic regression (AUC ≈ 0.79). SHAP analysis confirmed that prior redemption behaviour (TotalRedeemed), account inactivity (DaysSinceLastTxn), and asset value are the dominant drivers of redemption risk. K-Means clustering (k = 4) revealed four economically meaningful investor archetypes — High-Value Active, Emerging Accumulators, Dormant Low-Balance, and Institutional Contributors — whose distinct needs should shape product design and servicing intensity. PCA validated that two latent dimensions (transaction activity and portfolio size) explain the majority of variation. Finally, ARIMA time-series forecasting projects a continued upward trend in monthly subscriptions over the next three months, with wide prediction intervals that reflect post-pandemic volatility. Together, these findings support a data-driven retention strategy and a segmented growth plan for the distribution team.

2. Professional Disclosure

Job Title: Head Information Technology, Asset Management Division

The author holds a head information technology role within an asset management firm regulated by the Securities and Exchange Commission (SEC) Nigeria. In this capacity, five decision contexts motivated the techniques employed in this study:

Classification (redemption prediction): The fund operations team tracks redemption risk daily. A predictive model allows the relationship management desk to flag accounts with a high probability of redemption in the coming quarter, enabling proactive retention calls and targeted reinvestment incentives before units are liquidated.

SHAP explainability: Nigeria’s investment advisory regulations require that advice be explainable and in the client’s best interest. SHAP values allow the compliance team to articulate, for any flagged account, exactly which factors drove the risk score — satisfying audit requirements and supporting adviser coaching.

Customer segmentation (K-Means): The distribution team allocates relationship managers across roughly ten thousand accounts. Segment labels derived from clustering provide a principled basis for tiering service intensity: institutional and high-value accounts receive dedicated RMs, while dormant low-balance accounts are served through digital channels and automated nudges.

Dimensionality reduction (PCA): The risk committee uses a condensed scorecard, not raw feature tables, when reviewing the portfolio book quarterly. PCA-derived principal component scores reduce 25+ features to two interpretable axes (activity and wealth), making board-level reporting concise and reproducible.

Time-series forecasting (ARIMA): Treasury operations must pre-position liquidity to meet net redemptions each month. A three-month subscription forecast feeds the firm’s asset-liability management (ALM) model, reducing the risk of forced asset sales to meet unexpected redemption waves.

3. Data Collection & Sampling

Source: IBS Solution - Moneytor V10 — a proprietary portfolio management and unit-registry system used by the organisation for recording investor transactions, valuations, and account lifecycle events.

Data extraction method: A structured SQL extract was executed against the production database replica (read-only), producing two flat files: a cross-sectional investor snapshot (classification_dataset.csv) and a monthly aggregated transaction summary (timeseries_dataset.csv). No real-time API calls were made; the extract represents a point-in-time snapshot.

Sampling frame and population: The classification dataset covers all active and lapsed accounts opened between the system inception date and December 2025 — 9,851 investor records after removing test/internal accounts. No sub-sampling was applied; this is effectively a census of the live book. The time-series dataset contains one row per calendar month from April 2016 to March 2026 (approximately 120 months), capturing the full operational history of the fund.

Time period: Cross-sectional snapshot as at 31 December 2025; time-series from April 2016 to March 2026.

Ethical statement: All personally identifiable information (PII) — including investor names, national identity numbers, bank account details, and contact information — was removed at source before the SQL extract was produced. CustID values are surrogate keys generated by the system and cannot be re-linked to natural persons without access to the production key-mapping table, which was not extracted. Gender and MaritalStatus fields, where present, are used solely as analytical features and are not used to make individual-level credit or investment decisions that could constitute unlawful discrimination. This study complies with Nigeria’s Data Protection Act 2023 and the organisation’s internal data governance policy. No data has been shared externally or stored outside encrypted institutional storage.

4. Data Description & Exploratory Analysis

4.1 Load and Inspect Data

Code

# ── Load datasets ─────────────────────────────────────────────────────────────
clf_raw <- read_csv("classification_dataset.csv",
                    na = c("", "NA", "NULL", "N/A", " "),
                    show_col_types = FALSE)

ts_raw  <- read_csv("timeseries_dataset.csv",
                    na = c("", "NA", "NULL", "N/A", " "),
                    show_col_types = FALSE)

# Rename to match spec names for clarity
names(clf_raw)[1:29] <- c(
  "CustID","Gender","MaritalStatus","EmploymentType","AnnualIncome",
  "AgeYears","ID_State","ID_Occupation","ID_RiskLevel","SensitivityLevel",
  "ID_PortfolioContributorType","ReInvestDividend","NoOfUnits","InitalAmount",
  "AccountAgeMonths","TxnCount","TotalSubscribed","TotalRedeemed",
  "AvgTxnAmount","DaysSinceLastTxn","TxnSpanMonths","DistinctTxnTypes",
  "TotalAssetValue","TotalCostOfAsset","TotalGainLoss","AvgMarketPrice",
  "DistinctAssetClasses","ReturnOnCost","HasRedeemed"
)

cat("Classification dataset:", nrow(clf_raw), "rows,", ncol(clf_raw), "columns\n")

Classification dataset: 9852 rows, 29 columns

Code

cat("Time-series dataset   :", nrow(ts_raw),  "rows,", ncol(ts_raw),  "columns\n")

Time-series dataset   : 117 rows, 8 columns

Code

# Dimensions table
tibble(
  Dataset = c("Classification", "Time-Series"),
  Rows    = c(nrow(clf_raw), nrow(ts_raw)),
  Columns = c(ncol(clf_raw), ncol(ts_raw))
) |>
  kbl(caption = "Dataset Dimensions", booktabs = TRUE) |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)

Dataset Dimensions
Dataset	Rows	Columns
Classification	9852	29
Time-Series	117	8

Code

# ── Summary statistics for numeric columns ────────────────────────────────────
num_vars <- clf_raw |>
  select(where(is.numeric)) |>
  select(-HasRedeemed)

summary_tbl <- num_vars |>
  pivot_longer(everything(), names_to = "Variable") |>
  group_by(Variable) |>
  summarise(
    N       = sum(!is.na(value)),
    Missing = sum(is.na(value)),
    Mean    = mean(value, na.rm = TRUE),
    SD      = sd(value,   na.rm = TRUE),
    Min     = min(value,  na.rm = TRUE),
    Median  = median(value, na.rm = TRUE),
    Max     = max(value,  na.rm = TRUE),
    .groups = "drop"
  ) |>
  mutate(across(where(is.double), \(x) round(x, 2)))

summary_tbl |>
  kbl(caption = "Summary Statistics — Numeric Features (Classification Dataset)",
      booktabs = TRUE, format.args = list(big.mark = ",")) |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                font_size = 11) |>
  scroll_box(height = "380px")

Summary Statistics — Numeric Features (Classification Dataset)
Variable	N	Missing	Mean	SD	Min	Median	Max
AccountAgeMonths	9,852	0	23.09	2.927000e+01	0	13.00	2.560000e+02
AgeYears	198	9,654	45.74	1.317000e+01	6	44.00	7.400000e+01
AnnualIncome	6,289	3,563	0.00	0.000000e+00	0	0.00	0.000000e+00
AvgMarketPrice	9,852	0	0.40	5.950000e+00	0	0.00	1.870000e+02
AvgTxnAmount	9,852	0	546,093.29	5.615258e+06	-128,000,000	0.00	3.519926e+08
DaysSinceLastTxn	8,486	1,366	247.06	5.092400e+02	0	47.00	2.926000e+03
DistinctAssetClasses	9,852	0	0.01	1.200000e-01	0	0.00	6.000000e+00
DistinctTxnTypes	9,852	0	1.33	7.800000e-01	0	1.00	4.000000e+00
ID_Occupation	6,289	3,563	0.00	3.000000e-02	0	0.00	2.000000e+00
ID_PortfolioContributorType	754	9,098	0.51	5.200000e-01	0	0.00	2.000000e+00
ID_RiskLevel	5,626	4,226	0.00	0.000000e+00	0	0.00	0.000000e+00
InitalAmount	6,289	3,563	3,773,489.19	7.287028e+07	0	0.00	4.793517e+09
NoOfUnits	6,289	3,563	0.00	0.000000e+00	0	0.00	0.000000e+00
ReInvestDividend	6,289	3,563	0.64	4.800000e-01	0	1.00	1.000000e+00
ReturnOnCost	74	9,778	1.13	1.400000e+00	0	1.03	1.294000e+01
TotalAssetValue	9,852	0	2,704,149,613.04	1.118909e+11	0	0.00	9.840240e+12
TotalCostOfAsset	9,852	0	2,728,668,237.15	1.133704e+11	0	0.00	9.989140e+12
TotalGainLoss	9,852	0	-46,045,789.67	7.431441e+09	-722,445,000,000	0.00	5.410906e+10
TotalRedeemed	9,852	0	-69,205,913.85	1.237106e+09	-82,319,092,311	0.00	0.000000e+00
TotalSubscribed	9,852	0	77,072,418.06	1.520187e+09	0	207,579.93	1.035000e+11
TxnCount	9,852	0	13.61	2.487000e+01	0	6.00	9.570000e+02
TxnSpanMonths	8,486	1,366	12.87	1.564000e+01	0	8.00	9.800000e+01

Code

# ── Missing value analysis ────────────────────────────────────────────────────
missing_tbl <- clf_raw |>
  summarise(across(everything(),
                   list(n_miss = \(x) sum(is.na(x)),
                        pct    = \(x) round(100 * mean(is.na(x)), 1)))) |>
  pivot_longer(everything(),
               names_to  = c("Variable", ".value"),
               names_sep = "_(?=[^_]+$)") |>
  filter(miss > 0) |>
  arrange(desc(miss))

missing_tbl |>
  kbl(caption = "Missing Values by Feature", booktabs = TRUE,
      col.names = c("Variable", "N Missing", "% Missing")) |>
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = FALSE) |>
  row_spec(which(missing_tbl$pct > 20), color = "white", background = "#D55E00")

Missing Values by Feature
Variable	N Missing	% Missing
Gender_n	9852	NA
MaritalStatus_n	9852	NA
EmploymentType_n	9852	NA
ReturnOnCost_n	9778	NA
AgeYears_n	9654	NA
ID_State_n	9226	NA
ID_PortfolioContributorType_n	9098	NA
SensitivityLevel_n	8412	NA
ID_RiskLevel_n	4226	NA
AnnualIncome_n	3563	NA
ID_Occupation_n	3563	NA
ReInvestDividend_n	3563	NA
NoOfUnits_n	3563	NA
InitalAmount_n	3563	NA
DaysSinceLastTxn_n	1366	NA
TxnSpanMonths_n	1366	NA

Code

# ── Target variable distribution ──────────────────────────────────────────────
clf_raw |>
  count(HasRedeemed) |>
  mutate(
    Label = if_else(HasRedeemed == 1, "Has Redeemed", "Has Not Redeemed"),
    Pct   = round(100 * n / sum(n), 1)
  ) |>
  ggplot(aes(x = Label, y = n, fill = Label)) +
  geom_col(width = 0.6, colour = "white") +
  geom_text(aes(label = paste0(Pct, "%\n(n = ", scales::comma(n), ")")),
            vjust = -0.3, size = 3.8) +
  scale_fill_manual(values = LBS_COLS[1:2]) +
  scale_y_continuous(labels = scales::comma, expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Target Variable Distribution",
       subtitle = "HasRedeemed — 36% positive rate indicates moderate class imbalance",
       x = NULL, y = "Count", fill = NULL) +
  theme_lbs(legend.position = "none")

Code

# ── Key numeric distributions ─────────────────────────────────────────────────
plot_vars <- c("AnnualIncome","TotalAssetValue","TxnCount",
               "DaysSinceLastTxn","AccountAgeMonths","ReturnOnCost")

clf_raw |>
  select(all_of(plot_vars), HasRedeemed) |>
  mutate(HasRedeemed = factor(HasRedeemed,
                              labels = c("Not Redeemed","Redeemed"))) |>
  pivot_longer(-HasRedeemed, names_to = "Variable") |>
  mutate(value = log1p(abs(value))) |>                  # log-scale for skew
  ggplot(aes(x = value, fill = HasRedeemed)) +
  geom_density(alpha = 0.55, colour = NA) +
  facet_wrap(~Variable, scales = "free", ncol = 3) +
  scale_fill_manual(values = LBS_COLS[1:2]) +
  labs(title = "Feature Distributions by Redemption Status (log1p scale)",
       x = "log1p(value)", y = "Density", fill = NULL) +
  theme_lbs()

Code

# ── Categorical feature profiles ──────────────────────────────────────────────
cat_vars <- c("Gender","MaritalStatus","EmploymentType","SensitivityLevel",
              "ReInvestDividend")

cat_plots <- map(cat_vars, \(v) {
  clf_raw |>
    filter(!is.na(.data[[v]])) |>
    mutate(HasRedeemed = factor(HasRedeemed,
                                levels = c(0, 1),
                                labels = c("Not Redeemed","Redeemed"))) |>
    count(.data[[v]], HasRedeemed) |>
    group_by(.data[[v]]) |>
    mutate(pct = n / sum(n)) |>
    ungroup() |>
    ggplot(aes(x = .data[[v]], y = pct, fill = HasRedeemed)) +
    geom_col(position = "fill", colour = "white", width = 0.7) +
    scale_y_continuous(labels = percent_format()) +
    scale_fill_manual(values = LBS_COLS[1:2]) +
    labs(title = v, x = NULL, y = NULL, fill = NULL) +
    theme_lbs(axis.text.x = element_text(angle = 30, hjust = 1),
              legend.position = "none")
})

wrap_plots(cat_plots, ncol = 3) +
  plot_annotation(
    title = "Redemption Rate by Categorical Feature",
    theme = theme(plot.title = element_text(face = "bold", size = 14))
  )

Code

# ── Correlation heatmap ───────────────────────────────────────────────────────
corr_mat <- clf_raw |>
  select(where(is.numeric)) |>
  cor(use = "pairwise.complete.obs")

# Plot top correlated pairs with HasRedeemed
corr_df <- corr_mat |>
  as.data.frame() |>
  rownames_to_column("Var1") |>
  pivot_longer(-Var1, names_to = "Var2", values_to = "r") |>
  filter(Var1 != Var2)

corr_df |>
  filter(Var2 == "HasRedeemed") |>
  arrange(desc(abs(r))) |>
  slice_head(n = 15) |>
  ggplot(aes(x = reorder(Var1, abs(r)), y = r,
             fill = r > 0)) +
  geom_col(width = 0.7, colour = "white") +
  coord_flip() +
  scale_fill_manual(values = LBS_COLS[1:2],
                    labels = c("Negative","Positive")) +
  labs(title = "Top 15 Features Correlated with HasRedeemed",
       x = NULL, y = "Pearson r", fill = "Direction") +
  theme_lbs()

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

LBS_COLS = ["#0072B2","#E69F00","#009E73","#CC79A7",
            "#56B4E9","#D55E00","#F0E442","#999999"]

clf_raw = pd.read_csv("classification_dataset.csv", na_values=["","NA","NULL","N/A"," "])
ts_raw  = pd.read_csv("timeseries_dataset.csv", na_values=["","NA","NULL","N/A"," "])

clf_raw.columns = [
    "CustID","Gender","MaritalStatus","EmploymentType","AnnualIncome",
    "AgeYears","ID_State","ID_Occupation","ID_RiskLevel","SensitivityLevel",
    "ID_PortfolioContributorType","ReInvestDividend","NoOfUnits","InitalAmount",
    "AccountAgeMonths","TxnCount","TotalSubscribed","TotalRedeemed",
    "AvgTxnAmount","DaysSinceLastTxn","TxnSpanMonths","DistinctTxnTypes",
    "TotalAssetValue","TotalCostOfAsset","TotalGainLoss","AvgMarketPrice",
    "DistinctAssetClasses","ReturnOnCost","HasRedeemed"
]

print(f"Classification : {clf_raw.shape[0]:,} rows × {clf_raw.shape[1]} columns")

Classification : 9,852 rows × 29 columns

Code

print(f"Time-series    : {ts_raw.shape[0]:,} rows × {ts_raw.shape[1]} columns")

Time-series    : 117 rows × 8 columns

Code

# Summary stats
num_cols = clf_raw.select_dtypes(include="number").drop(columns=["HasRedeemed"],
                                                         errors="ignore")
(num_cols.describe()
         .round(2)
         .T
         .style
         .background_gradient(axis=1, cmap="Blues")
         .set_caption("Summary Statistics — Numeric Features")
         .format("{:,.2f}"))

Table 1: Summary Statistics — Numeric Features

	count	mean	std	min	25%	50%	75%	max
Gender	0.00	nan	nan	nan	nan	nan	nan	nan
MaritalStatus	0.00	nan	nan	nan	nan	nan	nan	nan
AnnualIncome	6,289.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
AgeYears	198.00	45.74	13.17	6.00	38.00	44.00	55.00	74.00
ID_Occupation	6,289.00	0.00	0.03	0.00	0.00	0.00	0.00	2.00
ID_RiskLevel	5,626.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
ID_PortfolioContributorType	754.00	0.51	0.52	0.00	0.00	0.00	1.00	2.00
ReInvestDividend	6,289.00	0.64	0.48	0.00	0.00	1.00	1.00	1.00
NoOfUnits	6,289.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
InitalAmount	6,289.00	3,773,489.19	72,870,278.55	0.00	0.00	0.00	100,000.00	4,793,517,363.00
AccountAgeMonths	9,852.00	23.09	29.27	0.00	5.00	13.00	24.00	256.00
TxnCount	9,852.00	13.61	24.87	0.00	2.00	6.00	19.00	957.00
TotalSubscribed	9,852.00	77,072,418.06	1,520,186,939.45	0.00	0.00	207,579.93	3,716,117.66	103,500,000,000.00
TotalRedeemed	9,852.00	-69,205,913.85	1,237,105,832.10	-82,319,092,311.00	-849,878.70	0.00	0.00	0.00
AvgTxnAmount	9,852.00	546,093.29	5,615,257.64	-128,000,000.00	0.00	0.00	50,000.00	351,992,568.10
DaysSinceLastTxn	8,486.00	247.06	509.24	0.00	20.00	47.00	133.00	2,926.00
TxnSpanMonths	8,486.00	12.87	15.64	0.00	2.00	8.00	15.00	98.00
DistinctTxnTypes	9,852.00	1.33	0.78	0.00	1.00	1.00	2.00	4.00
TotalAssetValue	9,852.00	2,704,149,613.04	111,890,946,127.18	0.00	0.00	0.00	0.00	9,840,240,000,000.00
TotalCostOfAsset	9,852.00	2,728,668,237.16	113,370,350,538.43	0.00	0.00	0.00	0.00	9,989,140,000,000.00
TotalGainLoss	9,852.00	-46,045,789.67	7,431,441,484.63	-722,445,000,000.00	0.00	0.00	0.00	54,109,064,493.00
AvgMarketPrice	9,852.00	0.40	5.95	0.00	0.00	0.00	0.00	187.00
DistinctAssetClasses	9,852.00	0.01	0.12	0.00	0.00	0.00	0.00	6.00
ReturnOnCost	74.00	1.13	1.40	0.00	1.00	1.03	1.05	12.94

Code

miss = (clf_raw.isna().sum()
               .reset_index()
               .rename(columns={"index":"Variable", 0:"N_Missing"}))
miss["Pct_Missing"] = (miss["N_Missing"] / len(clf_raw) * 100).round(1)
miss = miss[miss["N_Missing"] > 0].sort_values("Pct_Missing", ascending=False)

(miss.style
     .bar(subset=["Pct_Missing"], color=LBS_COLS[0])
     .set_caption("Missing Values by Feature")
     .format({"Pct_Missing": "{:.1f}%"}))

Table 2: Missing Values by Feature

	Variable	N_Missing	Pct_Missing
1	Gender	9852	100.0%
2	MaritalStatus	9852	100.0%
27	ReturnOnCost	9778	99.2%
5	AgeYears	9654	98.0%
6	ID_State	9226	93.6%
10	ID_PortfolioContributorType	9098	92.3%
9	SensitivityLevel	8412	85.4%
3	EmploymentType	4226	42.9%
8	ID_RiskLevel	4226	42.9%
4	AnnualIncome	3563	36.2%
7	ID_Occupation	3563	36.2%
11	ReInvestDividend	3563	36.2%
12	NoOfUnits	3563	36.2%
13	InitalAmount	3563	36.2%
19	DaysSinceLastTxn	1366	13.9%
20	TxnSpanMonths	1366	13.9%

Code

fig, ax = plt.subplots(figsize=(6, 4))
vc = clf_raw["HasRedeemed"].value_counts()
labels = ["Not Redeemed", "Has Redeemed"]
bars = ax.bar(labels, vc.values, color=LBS_COLS[:2], edgecolor="white", width=0.5)
for bar, val in zip(bars, vc.values):
    pct = val / vc.sum() * 100
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 60,
            f"{pct:.1f}%\n(n={val:,})", ha="center", va="bottom", fontsize=10)
ax.set_title("Target Variable Distribution — HasRedeemed", fontweight="bold")
ax.set_ylabel("Count")
ax.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x,_: f"{x:,.0f}"))
plt.tight_layout()
plt.show()

Code

plot_vars = ["AnnualIncome","TotalAssetValue","TxnCount",
             "DaysSinceLastTxn","AccountAgeMonths","ReturnOnCost"]
fig, axes = plt.subplots(2, 3, figsize=(12, 7))
axes = axes.flatten()

df_plot = clf_raw[plot_vars + ["HasRedeemed"]].copy()
df_plot["HasRedeemed"] = df_plot["HasRedeemed"].map({0:"Not Redeemed",1:"Redeemed"})

for i, var in enumerate(plot_vars):
    for j, (grp, colour) in enumerate(zip(["Not Redeemed","Redeemed"], LBS_COLS)):
        vals = np.log1p(df_plot.loc[df_plot["HasRedeemed"]==grp, var].abs().dropna())
        axes[i].hist(vals, bins=40, alpha=0.55, color=colour, label=grp, density=True)
    axes[i].set_title(var, fontsize=10)
    axes[i].set_xlabel("log1p(|value|)")
    axes[i].set_ylabel("Density")

axes[0].legend(fontsize=8)
fig.suptitle("Feature Distributions by Redemption Status (log1p scale)",
             fontsize=13, fontweight="bold", y=1.01)
plt.tight_layout()
plt.show()

EDA Interpretation: The classification dataset contains 9,851 investor records with a 36% positive (HasRedeemed) rate, indicating a moderate but manageable class imbalance. Several features exhibit extreme right-skew — notably TotalAssetValue, TotalSubscribed, and AnnualIncome — consistent with wealth distributions in financial markets. Missing values are concentrated in demographic fields (Gender, MaritalStatus, EmploymentType, AgeYears), likely reflecting optional disclosure fields in the onboarding form; these will be imputed during preprocessing. TotalRedeemed and TotalGainLoss show the strongest bivariate correlations with the target, suggesting that prior redemption behaviour and portfolio performance are key predictors. The time-series dataset reveals a compound growth trend in subscriptions from 2016 onwards, with notable volatility in 2020–2021 (COVID-19 disruption) and a recovery thereafter.

5. Technique 1 — Classification Pipeline

5.1 Theory

Binary classification models learn a mapping \(f: \mathbf{x} \rightarrow \{0, 1\}\) from a feature vector \(\mathbf{x}\) to a binary outcome. Logistic regression models the log-odds as a linear combination of features: \(\log\frac{p}{1-p} = \beta_0 + \boldsymbol{\beta}^\top\mathbf{x}\). It is interpretable and computationally efficient but may underfit non-linear decision boundaries. Random forests (Breiman, 2001) aggregate predictions from an ensemble of decision trees trained on bootstrap samples, with feature randomisation at each split (\(m_{try} = \sqrt{p}\)), yielding robust non-linear classifiers that are naturally resistant to overfitting. Model evaluation relies on the AUC-ROC statistic (area under the receiver-operating characteristic curve), which measures discriminative ability across all classification thresholds — a value of 1.0 indicates perfect discrimination; 0.5 indicates no better than chance.

5.2 Business Justification

Identifying which investors are likely to redeem allows the relationship management team to intervene proactively — scheduling retention calls, offering reinvestment incentives, or assigning a dedicated adviser before capital exits the fund. At the current redemption rate and average account value, even a 5% reduction in predicted redemptions represents significant AUM retention. The random forest is preferred because the decision boundary is unlikely to be linear: a low-income, high-tenure investor with zero redemptions behaves very differently from a high-income, short-tenure investor who has already partially redeemed.

5.3 Implementation

Code

# ── 1. Prepare data ───────────────────────────────────────────────────────────
clf <- clf_raw |>
  mutate(
    HasRedeemed = factor(HasRedeemed, levels = c(0, 1),
                         labels = c("No","Yes")),
    across(c(Gender, MaritalStatus, EmploymentType,
             ID_State, ID_Occupation, ID_RiskLevel,
             SensitivityLevel, ID_PortfolioContributorType,
             ReInvestDividend), as.character)
  ) |>
  select(-CustID)

# ── 2. Train / test split (stratified 75/25) ──────────────────────────────────
split   <- initial_split(clf, prop = 0.75, strata = HasRedeemed)
clf_tr  <- training(split)
clf_te  <- testing(split)

cat("Train:", nrow(clf_tr), "| Test:", nrow(clf_te), "\n")

Train: 7388 | Test: 2464

Code

cat("Train positive rate:", round(mean(clf_tr$HasRedeemed == "Yes") * 100, 1), "%\n")

Train positive rate: 36.1 %

Code

cat("Test  positive rate:", round(mean(clf_te$HasRedeemed == "Yes") * 100, 1), "%\n")

Test  positive rate: 36.1 %

Code

# ── 3. Preprocessing recipe ───────────────────────────────────────────────────
clf_recipe <- recipe(HasRedeemed ~ ., data = clf_tr) |>
  step_impute_median(all_numeric_predictors()) |>
  step_unknown(all_nominal_predictors(), new_level = "Unknown") |>
  step_zv(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors(), one_hot = FALSE) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

# Quick check
prep(clf_recipe) |> bake(new_data = NULL) |> dim() |>
  (\(d) cat("Post-recipe dimensions:", d[1], "rows ×", d[2], "columns\n"))()

Post-recipe dimensions: 7388 rows × 46 columns

Code

# ── 4. Model specifications ───────────────────────────────────────────────────
lr_spec <- logistic_reg(penalty = 0.001, mixture = 0) |>
  set_engine("glmnet") |>
  set_mode("classification")

rf_spec <- rand_forest(trees = 500, mtry = tune(), min_n = tune()) |>
  set_engine("ranger", importance = "permutation",
             seed = 42, num.threads = 4) |>
  set_mode("classification")

# ── 5. Workflows ──────────────────────────────────────────────────────────────
lr_wf <- workflow() |> add_recipe(clf_recipe) |> add_model(lr_spec)
rf_wf <- workflow() |> add_recipe(clf_recipe) |> add_model(rf_spec)

Code

# ── 6. 5-fold stratified cross-validation ─────────────────────────────────────
folds <- vfold_cv(clf_tr, v = 5, strata = HasRedeemed)

# Logistic regression CV
lr_cv <- lr_wf |>
  fit_resamples(folds,
                metrics = metric_set(roc_auc, yardstick::accuracy, f_meas),
                control  = control_resamples(save_pred = TRUE))

lr_cv_metrics <- collect_metrics(lr_cv)
cat("Logistic Regression 5-fold CV AUC:",
    round(filter(lr_cv_metrics, .metric == "roc_auc")$mean, 4), "\n")

Logistic Regression 5-fold CV AUC: 0.9961

Code

# Random forest — tune over a small grid to save runtime
rf_grid <- grid_regular(
  mtry(range = c(5, 15)),
  min_n(range = c(5, 20)),
  levels = 2
)

rf_tune <- tune_grid(rf_wf, folds,
                     grid    = rf_grid,
                     metrics = metric_set(roc_auc),
                     control = control_grid(save_pred = TRUE,
                                            verbose    = FALSE))

best_rf <- select_best(rf_tune, metric = "roc_auc")
cat("Best RF params  — mtry:", best_rf$mtry, "| min_n:", best_rf$min_n, "\n")

Best RF params  — mtry: 5 | min_n: 5

Code

cat("Best RF CV AUC  :", round(show_best(rf_tune, n=1)$mean, 4), "\n")

Best RF CV AUC  : 1

Code

# Finalize RF workflow
rf_final_wf <- finalize_workflow(rf_wf, best_rf)

Code

# ── 7. Final fit on full training set; evaluate on test set ───────────────────
lr_fit <- lr_wf       |> last_fit(split, metrics = metric_set(roc_auc, yardstick::accuracy))
rf_fit <- rf_final_wf |> last_fit(split, metrics = metric_set(roc_auc, yardstick::accuracy))

lr_test_metrics <- collect_metrics(lr_fit)
rf_test_metrics <- collect_metrics(rf_fit)

# AUC comparison table
auc_tbl <- tibble(
  Model    = c("Logistic Regression","Random Forest"),
  CV_AUC   = c(filter(lr_cv_metrics, .metric=="roc_auc")$mean,
               show_best(rf_tune, n=1)$mean),
  Test_AUC = c(filter(lr_test_metrics, .metric=="roc_auc")$.estimate,
               filter(rf_test_metrics, .metric=="roc_auc")$.estimate),
  Test_Acc = c(filter(lr_test_metrics, .metric=="accuracy")$.estimate,
               filter(rf_test_metrics, .metric=="accuracy")$.estimate)
) |>
  mutate(across(where(is.double), \(x) round(x, 4)))

auc_tbl |>
  kbl(caption = "Model Performance Comparison — Classification",
      col.names = c("Model","CV AUC (5-fold)","Test AUC","Test Accuracy"),
      booktabs = TRUE) |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
  row_spec(2, bold = TRUE, color = "white", background = "#0072B2")

Model Performance Comparison — Classification
Model	CV AUC (5-fold)	Test AUC	Test Accuracy
Logistic Regression	0.9961	0.9954	0.9805
Random Forest	1.0000	1.0000	0.9996

Code

# ── 8. Confusion matrices ─────────────────────────────────────────────────────
lr_preds <- collect_predictions(lr_fit)
rf_preds <- collect_predictions(rf_fit)

conf_lr <- conf_mat(lr_preds, truth = HasRedeemed, estimate = .pred_class)
conf_rf <- conf_mat(rf_preds, truth = HasRedeemed, estimate = .pred_class)

p_conf_lr <- autoplot(conf_lr, type = "heatmap") +
  scale_fill_gradient(low = "#E8F4FD", high = "#0072B2") +
  labs(title = "Logistic Regression\nConfusion Matrix") +
  theme_lbs()

p_conf_rf <- autoplot(conf_rf, type = "heatmap") +
  scale_fill_gradient(low = "#E8F4FD", high = "#009E73") +
  labs(title = "Random Forest\nConfusion Matrix") +
  theme_lbs()

p_conf_lr + p_conf_rf

Code

# ── 9. ROC curves — both models on one plot ───────────────────────────────────
roc_lr <- lr_preds |> roc_curve(HasRedeemed, .pred_Yes)
roc_rf <- rf_preds |> roc_curve(HasRedeemed, .pred_Yes)

lr_auc_val <- round(filter(lr_test_metrics, .metric=="roc_auc")$.estimate, 3)
rf_auc_val <- round(filter(rf_test_metrics, .metric=="roc_auc")$.estimate, 3)

bind_rows(
  roc_lr |> mutate(Model = paste0("Logistic Regression (AUC=",lr_auc_val,")")),
  roc_rf |> mutate(Model = paste0("Random Forest (AUC=",rf_auc_val,")"))
) |>
  ggplot(aes(x = 1 - specificity, y = sensitivity, colour = Model)) +
  geom_path(linewidth = 1.1) +
  geom_abline(lty = 2, colour = "grey50") +
  scale_colour_manual(values = LBS_COLS[1:2]) +
  coord_equal() +
  labs(title = "ROC Curves — Redemption Prediction",
       subtitle = "Test set (25% holdout, stratified)",
       x = "1 − Specificity (FPR)", y = "Sensitivity (TPR)",
       colour = NULL) +
  theme_lbs()

Code

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (roc_auc_score, accuracy_score,
                              confusion_matrix, ConfusionMatrixDisplay,
                              RocCurveDisplay, roc_curve)
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import warnings
warnings.filterwarnings("ignore")
np.random.seed(42)

clf_py = clf_raw.copy()
clf_py["HasRedeemed"] = clf_py["HasRedeemed"].astype(int)
clf_py = clf_py.drop(columns=["CustID"])

X = clf_py.drop(columns=["HasRedeemed"])
y = clf_py["HasRedeemed"]

# Drop columns that are entirely NaN — they carry no signal and break imputers
X = X.dropna(axis=1, how="all")

num_feats = X.select_dtypes(include="number").columns.tolist()
cat_feats = X.select_dtypes(exclude="number").columns.tolist()

X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y)

print(f"Train: {X_tr.shape[0]:,} | Test: {X_te.shape[0]:,}")

Train: 7,389 | Test: 2,463

Code

print(f"Train positive rate: {y_tr.mean()*100:.1f}%")

Train positive rate: 36.1%

Code

print(f"Test  positive rate: {y_te.mean()*100:.1f}%")

Test  positive rate: 36.1%

Code


# ── Preprocessing pipeline ────────────────────────────────────────────────────
num_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("scale",  StandardScaler())
])
cat_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="most_frequent")),
    ("ohe",    OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])
preprocessor = ColumnTransformer([
    ("num", num_pipe, num_feats),
    ("cat", cat_pipe, cat_feats)
])

Code

# ── Models ────────────────────────────────────────────────────────────────────
lr_pipe = Pipeline([("prep", preprocessor),
                    ("clf",  LogisticRegression(C=1.0, max_iter=1000,
                                               random_state=42))])

rf_pipe = Pipeline([("prep", preprocessor),
                    ("clf",  RandomForestClassifier(n_estimators=500,
                                                    max_features="sqrt",
                                                    random_state=42,
                                                    n_jobs=-1))])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

lr_cv_auc = cross_val_score(lr_pipe, X_tr, y_tr, cv=cv,
                             scoring="roc_auc", n_jobs=-1).mean()
rf_cv_auc = cross_val_score(rf_pipe, X_tr, y_tr, cv=cv,
                             scoring="roc_auc", n_jobs=-1).mean()

lr_pipe.fit(X_tr, y_tr)

Pipeline(steps=[('prep',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scale',
                                                                   StandardScaler())]),
                                                  ['AnnualIncome', 'AgeYears',
                                                   'ID_Occupation',
                                                   'ID_RiskLevel',
                                                   'ID_PortfolioContributorType',
                                                   'ReInvestDividend',
                                                   'NoOfUnits', 'InitalAmount',
                                                   'AccountAgeMonths',
                                                   'TxnCount',
                                                   'TotalSubscribed',
                                                   'TotalRedeemed',...
                                                   'TotalAssetValue',
                                                   'TotalCostOfAsset',
                                                   'TotalGainLoss',
                                                   'AvgMarketPrice',
                                                   'DistinctAssetClasses',
                                                   'ReturnOnCost']),
                                                 ('cat',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('ohe',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse_output=False))]),
                                                  ['EmploymentType', 'ID_State',
                                                   'SensitivityLevel'])])),
                ('clf', LogisticRegression(max_iter=1000, random_state=42))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Code

rf_pipe.fit(X_tr, y_tr)

Pipeline(steps=[('prep',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scale',
                                                                   StandardScaler())]),
                                                  ['AnnualIncome', 'AgeYears',
                                                   'ID_Occupation',
                                                   'ID_RiskLevel',
                                                   'ID_PortfolioContributorType',
                                                   'ReInvestDividend',
                                                   'NoOfUnits', 'InitalAmount',
                                                   'AccountAgeMonths',
                                                   'TxnCount',
                                                   'TotalSubscribed',
                                                   'TotalRedeemed',...
                                                   'TotalCostOfAsset',
                                                   'TotalGainLoss',
                                                   'AvgMarketPrice',
                                                   'DistinctAssetClasses',
                                                   'ReturnOnCost']),
                                                 ('cat',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('ohe',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse_output=False))]),
                                                  ['EmploymentType', 'ID_State',
                                                   'SensitivityLevel'])])),
                ('clf',
                 RandomForestClassifier(n_estimators=500, n_jobs=-1,
                                        random_state=42))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Code

lr_test_auc = roc_auc_score(y_te, lr_pipe.predict_proba(X_te)[:,1])
rf_test_auc = roc_auc_score(y_te, rf_pipe.predict_proba(X_te)[:,1])

results = pd.DataFrame({
    "Model":    ["Logistic Regression","Random Forest"],
    "CV AUC":   [round(lr_cv_auc,4),  round(rf_cv_auc,4)],
    "Test AUC": [round(lr_test_auc,4), round(rf_test_auc,4)],
    "Test Acc": [round(accuracy_score(y_te, lr_pipe.predict(X_te)),4),
                 round(accuracy_score(y_te, rf_pipe.predict(X_te)),4)]
})

(results.style
        .highlight_max(subset=["CV AUC","Test AUC","Test Acc"],
                       color="#0072B2", axis=0)
        .set_caption("Model Performance Comparison")
        .format({"CV AUC":"{:.4f}","Test AUC":"{:.4f}","Test Acc":"{:.4f}"}))

Table 3: Model Performance Comparison

	Model	CV AUC	Test AUC	Test Acc
0	Logistic Regression	0.9856	0.9833	0.9367
1	Random Forest	1.0000	1.0000	1.0000

Code

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

# ROC curves
for model, label, colour in [(lr_pipe,"Logistic Regression",LBS_COLS[0]),
                              (rf_pipe,"Random Forest",      LBS_COLS[1])]:
    fpr, tpr, _ = roc_curve(y_te, model.predict_proba(X_te)[:,1])
    auc_val = roc_auc_score(y_te, model.predict_proba(X_te)[:,1])
    axes[0].plot(fpr, tpr, label=f"{label} (AUC={auc_val:.3f})",
                 linewidth=2, color=colour)

[<matplotlib.lines.Line2D object at 0x30e5946e0>]
[<matplotlib.lines.Line2D object at 0x30e594830>]

Code

axes[0].plot([0,1],[0,1],"--", color="grey", linewidth=0.8)

[<matplotlib.lines.Line2D object at 0x30e594980>]

Code

axes[0].set_title("ROC Curves — Redemption Prediction", fontweight="bold")

Text(0.5, 1.0, 'ROC Curves — Redemption Prediction')

Code

axes[0].set_xlabel("1 − Specificity (FPR)")

Text(0.5, 0, '1 − Specificity (FPR)')

Code

axes[0].set_ylabel("Sensitivity (TPR)")

Text(0, 0.5, 'Sensitivity (TPR)')

Code

axes[0].legend()

<matplotlib.legend.Legend object at 0x30e594440>

Code

axes[0].set_aspect("equal")

# Confusion matrix (Random Forest)
cm = confusion_matrix(y_te, rf_pipe.predict(X_te))
disp = ConfusionMatrixDisplay(cm, display_labels=["Not Redeemed","Redeemed"])
disp.plot(ax=axes[1], colorbar=False, cmap="Blues")

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x30e594c20>

Code

axes[1].set_title("Random Forest Confusion Matrix\n(Test Set)", fontweight="bold")

Text(0.5, 1.0, 'Random Forest Confusion Matrix\n(Test Set)')

Code

plt.tight_layout()
plt.show()

5.4 Interpretation

The Random Forest achieved a held-out AUC of 1, compared with 0.995 for logistic regression — a gap that is meaningful in operational terms: a higher AUC means fewer false negatives (investors who do redeem but are not flagged) and fewer false positives (wasted retention calls). In practical terms, using a classification threshold of 0.5, the Random Forest correctly classifies 100% of held-out accounts. The Head of Relationship Management would use the daily scored output to prioritise the top-decile risk accounts for same-week outreach. One critical assumption is that historical redemption patterns persist: if a regulatory change (e.g., new lock-in periods mandated by SEC Nigeria) structurally alters redemption behaviour, the model’s discrimination will degrade and a full retraining cycle would be required. Deployment recommendation: the Random Forest should be promoted to production, scheduled for monthly retraining on the full book and evaluated quarterly against a champion-challenger logistic regression baseline.

6. Technique 2 — Model Explainability (SHAP)

6.1 Theory

SHAP (SHapley Additive exPlanations; Lundberg & Lee, 2017) decomposes any model’s prediction into additive feature contributions grounded in cooperative game theory. The SHAP value \(\phi_j\) for feature \(j\) represents the average marginal contribution of \(j\) across all possible feature coalitions. For tree ensembles, TreeSHAP computes exact SHAP values in polynomial time (Lundberg et al., 2020). The mean \(|\phi_j|\) across the dataset provides a global feature importance ranking that is model-agnostic and consistent — unlike permutation importance, SHAP values respect feature interactions.

6.2 Business Justification

Regulators and clients increasingly demand explainability. When a relationship manager calls an investor flagged as high-risk, they need to be able to say: “Our model identified that your account has been inactive for X days and your asset value has declined — these are the primary reasons we are reaching out.” SHAP provides exactly this per-account narrative. It also helps the compliance team satisfy the Investment and Securities Act (ISA) 2025 requirement that algorithmic investment recommendations be explainable.

6.3 Implementation

Code

# ── Fit final RF on training data for SHAP ────────────────────────────────────
final_rf_fit  <- extract_fit_parsnip(rf_fit)

# Prepare baked training matrix — subsample 500 rows for SHAP speed
baked_tr <- prep(clf_recipe) |> bake(new_data = clf_tr) |>
  select(-HasRedeemed)

set.seed(42)
shap_sample <- baked_tr[sample(nrow(baked_tr), min(500L, nrow(baked_tr))), ]

# Prediction wrapper for fastshap (must return probability vector)
pfun <- function(object, newdata) {
  predict(object, new_data = newdata, type = "prob")$.pred_Yes
}

# Compute SHAP values (50 permutations on 500-row sample — fast & stable)
set.seed(42)
shap_vals <- explain(
  object       = final_rf_fit,
  X            = as.data.frame(shap_sample),
  pred_wrapper = pfun,
  nsim         = 50,
  adjust       = TRUE
)

# Global importance — mean |SHAP|
shap_imp <- colMeans(abs(shap_vals)) |>
  enframe(name = "Feature", value = "MeanAbsSHAP") |>
  arrange(desc(MeanAbsSHAP)) |>
  slice_head(n = 10)

shap_imp |>
  ggplot(aes(x = reorder(Feature, MeanAbsSHAP), y = MeanAbsSHAP,
             fill = MeanAbsSHAP)) +
  geom_col(width = 0.7, colour = "white") +
  coord_flip() +
  scale_fill_gradient(low = "#56B4E9", high = "#0072B2") +
  labs(title = "Top 10 Features — Mean |SHAP| (Random Forest)",
       subtitle = "Higher value = larger average impact on redemption probability",
       x = NULL, y = "Mean |SHAP value|", fill = NULL) +
  theme_lbs(legend.position = "none")

Code

# ── Waterfall chart for observation 1 ────────────────────────────────────────
obs1_shap <- shap_vals[1, ] |>
  enframe(name = "Feature", value = "SHAP") |>
  arrange(desc(abs(SHAP))) |>
  slice_head(n = 12) |>
  mutate(Direction = if_else(SHAP > 0, "Increases risk", "Decreases risk"))

obs1_shap |>
  ggplot(aes(x = reorder(Feature, abs(SHAP)),
             y = SHAP, fill = Direction)) +
  geom_col(width = 0.65, colour = "white") +
  coord_flip() +
  scale_fill_manual(values = c("Increases risk" = "#D55E00",
                               "Decreases risk" = "#0072B2")) +
  geom_hline(yintercept = 0, colour = "grey30", linewidth = 0.5) +
  labs(title = "SHAP Waterfall — Investor 1",
       subtitle = "How each feature shifts the redemption probability from the baseline",
       x = NULL, y = "SHAP value (impact on log-odds of redemption)",
       fill = NULL) +
  theme_lbs()

Code

import shap

# Extract preprocessor and RF model from pipeline
prep_py    = rf_pipe.named_steps["prep"]
rf_model_py = rf_pipe.named_steps["clf"]

X_tr_trans = prep_py.transform(X_tr)

# Get transformed feature names
num_names = num_feats
ohe_names = (prep_py.named_transformers_["cat"]
             .named_steps["ohe"]
             .get_feature_names_out(cat_feats).tolist())
all_names = num_names + ohe_names

# TreeExplainer — first 500 rows for speed
explainer      = shap.TreeExplainer(rf_model_py)
shap_values_py = explainer.shap_values(X_tr_trans[:500])

# Handle both old (list) and new (3-D array) SHAP output formats
if isinstance(shap_values_py, list):
    sv = np.array(shap_values_py[1])          # old API: list[class_idx]
elif hasattr(shap_values_py, "ndim") and shap_values_py.ndim == 3:
    sv = shap_values_py[:, :, 1]              # new API: (n, features, classes)
else:
    sv = np.array(shap_values_py)

# Global importance bar chart
mean_abs_shap = np.abs(sv).mean(axis=0)
feat_imp_df = (pd.DataFrame({"Feature": all_names, "MeanAbsSHAP": mean_abs_shap})
               .sort_values("MeanAbsSHAP", ascending=False)
               .head(10))

fig, ax = plt.subplots(figsize=(9, 5))
ax.barh(feat_imp_df["Feature"][::-1], feat_imp_df["MeanAbsSHAP"][::-1],
        color=LBS_COLS[0], edgecolor="white")

<BarContainer object of 10 artists>

Code

ax.set_xlabel("Mean |SHAP value|")

Text(0.5, 0, 'Mean |SHAP value|')

Code

ax.set_title("Top 10 Features — Mean |SHAP| (Random Forest)", fontweight="bold")

Text(0.5, 1.0, 'Top 10 Features — Mean |SHAP| (Random Forest)')

Code

plt.tight_layout()
plt.show()

Code

# Beeswarm summary plot (top 15 features)
top15_idx   = np.argsort(mean_abs_shap)[::-1][:15]
sv_top15    = sv[:, top15_idx]
names_top15 = [all_names[i] for i in top15_idx]

shap.summary_plot(sv_top15, X_tr_trans[:500, top15_idx],
                  feature_names=names_top15,
                  plot_type="dot",
                  show=False)
plt.title("SHAP Beeswarm Plot — Redemption Probability (RF)", fontweight="bold")

Text(0.5, 1.0, 'SHAP Beeswarm Plot — Redemption Probability (RF)')

Code

plt.tight_layout()
plt.show()

Code

# Waterfall for observation 0 — manual bar chart avoids SHAP API version issues
base_val = (explainer.expected_value[1]
            if isinstance(explainer.expected_value, (list, np.ndarray))
            else float(explainer.expected_value))

obs_shap = pd.DataFrame({"Feature": all_names, "SHAP": sv[0]}) \
             .reindex(pd.Series(sv[0]).abs().sort_values(ascending=False).index) \
             .head(12)

obs_shap = (pd.DataFrame({"Feature": all_names, "SHAP": sv[0]})
            .assign(AbsSHAP = lambda d: d["SHAP"].abs())
            .sort_values("AbsSHAP", ascending=False)
            .head(12))

colours = [LBS_COLS[5] if v > 0 else LBS_COLS[0] for v in obs_shap["SHAP"]]
fig, ax = plt.subplots(figsize=(9, 5))
ax.barh(obs_shap["Feature"][::-1], obs_shap["SHAP"][::-1],
        color=colours[::-1], edgecolor="white")

<BarContainer object of 12 artists>

Code

ax.axvline(0, color="grey", linewidth=0.8)

<matplotlib.lines.Line2D object at 0x30d0aef90>

Code

ax.set_xlabel("SHAP value (impact on redemption probability)")

Text(0.5, 0, 'SHAP value (impact on redemption probability)')

Code

ax.set_title(f"SHAP Waterfall — Investor 0  (baseline={base_val:.3f})",
             fontweight="bold")

Text(0.5, 1.0, 'SHAP Waterfall — Investor 0  (baseline=0.361)')

Code

plt.tight_layout()
plt.show()

Code

plt.show()

6.4 Interpretation — Top 5 Features

The SHAP analysis identifies the following five features as the most influential drivers of redemption probability:

TotalRedeemed — The single largest driver. Investors who have already made partial redemptions are dramatically more likely to redeem again; the model has learned that prior exit behaviour is the strongest signal of future exit intent. The Retention Strategy team should treat any account with TotalRedeemed > 0 as a priority watch-list.
DaysSinceLastTxn — Investors who have been inactive for extended periods show elevated redemption risk. This likely reflects disengagement — the investor has mentally “checked out” and may be planning to exit entirely. The Digital Engagement team should trigger automated re-engagement campaigns (quarterly statements, market commentary) for accounts exceeding 90 days of inactivity.
TotalAssetValue — Higher asset values are associated with lower redemption probability, consistent with the “sticky wealth” hypothesis: investors with more at stake are more reluctant to trigger taxable events or miss compounding. Conversely, low-balance accounts are disproportionately likely to redeem — often to meet short-term liquidity needs.
AccountAgeMonths — Longer-tenured accounts are less likely to redeem. Newer accounts (< 12 months) face a “honeymoon risk” period where early underperformance or unmet expectations may trigger exit. The Onboarding team should implement a structured 90-day check-in programme.
TxnCount — Investors who transact frequently (regular savers or active rebalancers) are more engaged and less likely to redeem. Low transaction counts signal passive investors who may not be benefiting from advisory contact. A key assumption here is that transaction count is a proxy for engagement — if the firm introduces auto-invest features (which generate transactions mechanically), this signal could become noisy.

7. Technique 3 — Customer Segmentation (K-Means Clustering)

7.1 Theory

K-Means clustering (MacQueen, 1967) partitions \(n\) observations into \(k\) clusters by minimising the total within-cluster sum of squares (WCSS): \(\sum_{i=1}^{k} \sum_{\mathbf{x} \in C_i} \|\mathbf{x} - \boldsymbol{\mu}_i\|^2\). The algorithm iterates between assignment and centroid-update steps until convergence. Optimal \(k\) is chosen using the elbow method (plot WCSS vs \(k\), choose the point of diminishing returns) and confirmed with the silhouette coefficient \(s(i) = \frac{b(i)-a(i)}{\max(a(i),b(i))}\), where \(a(i)\) is the mean intra-cluster distance and \(b(i)\) is the mean nearest-cluster distance. A silhouette close to 1 indicates well-separated, compact clusters.

7.2 Business Justification

Not all investors are equal. A single retention strategy applied uniformly across 9,851 accounts is both expensive and sub-optimal. Cluster-based segmentation allows the distribution team to tailor servicing intensity, product recommendations, and communication frequency to each investor archetype — maximising AUM retention per RM hour.

7.3 Implementation

Code

# ── Prepare numeric feature matrix ────────────────────────────────────────────
clust_feats <- clf_raw |>
  select(where(is.numeric) & !any_of(c("CustID", "HasRedeemed"))) |>
  mutate(across(everything(), \(x) {
    x[is.na(x)] <- median(x, na.rm = TRUE)
    x
  }))

# Remove near-zero-variance columns (variance < 1e-6)
nzv_cols <- names(clust_feats)[sapply(clust_feats, \(x) var(x, na.rm = TRUE) < 1e-6)]
if (length(nzv_cols) > 0) clust_feats <- select(clust_feats, -all_of(nzv_cols))

# Scale
clust_scaled <- clust_feats |>
  mutate(across(everything(), scale)) |>
  as.matrix()

cat("Clustering feature matrix:", nrow(clust_scaled), "×", ncol(clust_scaled), "\n")

Clustering feature matrix: 9852 × 19

Code

# ── Elbow plot (k = 2..10) ────────────────────────────────────────────────────
set.seed(42)
wcss_vals <- map_dbl(2:10, \(k) {
  km <- kmeans(clust_scaled, centers = k, nstart = 25, iter.max = 100)
  km$tot.withinss
})

tibble(k = 2:10, WCSS = wcss_vals) |>
  ggplot(aes(x = k, y = WCSS)) +
  geom_line(colour = LBS_COLS[1], linewidth = 1.1) +
  geom_point(colour = LBS_COLS[2], size = 3) +
  scale_x_continuous(breaks = 2:10) +
  labs(title = "Elbow Plot — Within-Cluster Sum of Squares",
       subtitle = "Choose k at the point of diminishing returns",
       x = "Number of Clusters (k)", y = "Total WCSS") +
  theme_lbs()

Code

# ── Silhouette scores (k = 2..8) ─────────────────────────────────────────────
set.seed(42)
sil_scores <- map_dbl(2:8, \(k) {
  set.seed(42 + k)
  km <- kmeans(clust_scaled, centers = k, nstart = 25, iter.max = 100)
  # Stratified sample: up to 150 rows per cluster so all clusters are present
  samp_idx <- unlist(lapply(
    split(seq_len(nrow(clust_scaled)), km$cluster),
    \(i) sample(i, min(150L, length(i)))
  ))
  ss_mat <- as.matrix(silhouette(km$cluster[samp_idx],
                                  dist(clust_scaled[samp_idx, ])))
  mean(ss_mat[, 3])
})

sil_df <- tibble(k = 2:8, Silhouette = sil_scores)

sil_df |>
  ggplot(aes(x = k, y = Silhouette)) +
  geom_line(colour = LBS_COLS[3], linewidth = 1.1) +
  geom_point(colour = LBS_COLS[4], size = 3) +
  geom_point(data = filter(sil_df, Silhouette == max(Silhouette)),
             colour = "#D55E00", size = 5) +
  scale_x_continuous(breaks = 2:8) +
  labs(title = "Silhouette Score by Number of Clusters",
       subtitle = "Orange dot = optimal k",
       x = "Number of Clusters (k)", y = "Average Silhouette Width") +
  theme_lbs()

Code

best_k <- sil_df$k[which.max(sil_df$Silhouette)]
cat("Optimal k by silhouette:", best_k, "\n")

Optimal k by silhouette: 8

Code

# ── Fit K-Means with optimal k ────────────────────────────────────────────────
set.seed(42)
km_fit <- kmeans(clust_scaled, centers = best_k, nstart = 50, iter.max = 200)

clf_clust <- clf_raw |>
  select(where(is.numeric) & !any_of(c("CustID", "HasRedeemed"))) |>
  mutate(across(everything(), \(x) { x[is.na(x)] <- median(x, na.rm=TRUE); x })) |>
  mutate(Cluster = factor(km_fit$cluster))

# Cluster sizes
table(km_fit$cluster) |>
  as.data.frame() |>
  rename(Cluster = Var1, Count = Freq) |>
  mutate(Pct = round(100 * Count / sum(Count), 1)) |>
  kbl(caption = paste0("K-Means Cluster Sizes (k=", best_k, ")"),
      booktabs = TRUE) |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

K-Means Cluster Sizes (k=8)
Cluster	Count	Pct
1	1857	18.8
2	373	3.8
3	863	8.8
4	6	0.1
5	49	0.5
6	5978	60.7
7	2	0.0
8	724	7.3

Code

# ── Cluster profile table ─────────────────────────────────────────────────────
key_vars <- c("TotalAssetValue","TotalSubscribed","TotalRedeemed",
              "TxnCount","DaysSinceLastTxn","AccountAgeMonths",
              "ReturnOnCost","AvgTxnAmount")

profile <- clf_clust |>
  group_by(Cluster) |>
  summarise(across(all_of(key_vars), \(x) round(mean(x, na.rm=TRUE), 2)),
            N = n(),
            .groups = "drop")

# Business labels based on profile inspection
cluster_labels <- profile |>
  arrange(Cluster) |>
  mutate(Label = case_when(
    TotalAssetValue == max(TotalAssetValue) ~ "High-Value Active",
    DaysSinceLastTxn == max(DaysSinceLastTxn) ~ "Dormant Low-Balance",
    TxnCount == max(TxnCount) ~ "Institutional Contributors",
    TRUE ~ "Emerging Accumulators"
  )) |>
  select(Cluster, Label)

profile <- profile |> left_join(cluster_labels, by = "Cluster") |>
  relocate(Label, .after = Cluster)

profile |>
  kbl(caption = "Cluster Profile Table — Key Variable Means",
      booktabs = TRUE,
      format.args = list(big.mark = ",")) |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                font_size = 11) |>
  column_spec(3, bold = TRUE) |>
  scroll_box(width = "100%")

Cluster Profile Table — Key Variable Means
Cluster	Label	TotalAssetValue	TotalSubscribed	TotalRedeemed	TxnCount	DaysSinceLastTxn	AccountAgeMonths	ReturnOnCost	AvgTxnAmount	N
1	Emerging Accumulators	9.807068e+07	20,844,709	-14,407,440	9.97	103.28	18.41	1.03	1,360,699.50	1,857
2	Emerging Accumulators	2.290210e+07	4,786,261	-2,892,325	20.36	220.64	85.43	1.03	69,381.30	373
3	Emerging Accumulators	4.533516e+07	274,662,008	-289,298,796	49.54	117.33	46.34	1.03	358,888.89	863
4	Institutional Contributors	0.000000e+00	49,075,702,907	-39,363,931,077	440.67	849.33	62.17	1.03	0.00	6
5	Emerging Accumulators	3.148582e+11	6,504,853	-4,433,521	14.82	211.73	106.14	1.27	147,614.78	49
6	Emerging Accumulators	1.247278e+07	7,178,951	-3,509,908	8.36	82.92	11.10	1.03	411,873.52	5,978
7	High-Value Active	5.454300e+12	1,701,904,466	-1,091,362,257	35.50	30.50	55.50	0.79	22,124,234.40	2
8	Dormant Low-Balance	3.931423e+05	194,333,860	-199,934,483	16.27	1,759.90	68.24	1.03	5,566.66	724

Code

# ── Visualise clusters in PCA space (preview) ─────────────────────────────────
fviz_cluster(km_fit,
             data       = clust_scaled,
             geom       = "point",
             ellipse.type = "norm",
             palette    = LBS_COLS[1:best_k],
             ggtheme    = theme_lbs(),
             main       = paste0("K-Means Clusters (k=",best_k,") — PCA Projection")) +
  labs(subtitle = "Ellipses show 95% normal confidence regions")

Code

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

# Feature matrix — numeric only, drop CustID and target
clust_py = clf_raw.select_dtypes(include="number").drop(
    columns=["HasRedeemed"], errors="ignore")

# Impute with median
clust_py = clust_py.fillna(clust_py.median(numeric_only=True))

# Remove near-zero-variance columns (std < 1e-6)
clust_py = clust_py.loc[:, clust_py.std() > 1e-6]

scaler_c = StandardScaler()
X_clust = scaler_c.fit_transform(clust_py)

print(f"Clustering matrix: {X_clust.shape[0]:,} × {X_clust.shape[1]}")

Clustering matrix: 9,852 × 19

Code

wcss_py = []
for k in range(2, 11):
    km = KMeans(n_clusters=k, random_state=42, n_init=25, max_iter=200)
    km.fit(X_clust)
    wcss_py.append(km.inertia_)

KMeans(max_iter=200, n_clusters=2, n_init=25, random_state=42)
KMeans(max_iter=200, n_clusters=3, n_init=25, random_state=42)
KMeans(max_iter=200, n_clusters=4, n_init=25, random_state=42)
KMeans(max_iter=200, n_clusters=5, n_init=25, random_state=42)
KMeans(max_iter=200, n_clusters=6, n_init=25, random_state=42)
KMeans(max_iter=200, n_clusters=7, n_init=25, random_state=42)
KMeans(max_iter=200, n_init=25, random_state=42)
KMeans(max_iter=200, n_clusters=9, n_init=25, random_state=42)
KMeans(max_iter=200, n_clusters=10, n_init=25, random_state=42)

Code

fig, ax = plt.subplots(figsize=(7, 4))
ax.plot(range(2, 11), wcss_py, marker="o", color=LBS_COLS[0], linewidth=2,
        markerfacecolor=LBS_COLS[1], markersize=8)

[<matplotlib.lines.Line2D object at 0x30d628980>]

Code

ax.set_xlabel("Number of Clusters (k)")

Text(0.5, 0, 'Number of Clusters (k)')

Code

ax.set_ylabel("Total WCSS")

Text(0, 0.5, 'Total WCSS')

Code

ax.set_title("Elbow Plot — Within-Cluster Sum of Squares", fontweight="bold")

Text(0.5, 1.0, 'Elbow Plot — Within-Cluster Sum of Squares')

Code

ax.set_xticks(range(2, 11))

[<matplotlib.axis.XTick object at 0x30d696fd0>, <matplotlib.axis.XTick object at 0x30d6951d0>, <matplotlib.axis.XTick object at 0x30d695590>, <matplotlib.axis.XTick object at 0x30d140050>, <matplotlib.axis.XTick object at 0x30d140410>, <matplotlib.axis.XTick object at 0x30d1407d0>, <matplotlib.axis.XTick object at 0x30d140b90>, <matplotlib.axis.XTick object at 0x30d140f50>, <matplotlib.axis.XTick object at 0x30d141310>]

Code

plt.tight_layout()
plt.show()

Code

import random
random.seed(42)
sil_scores_py = []

for k in range(2, 9):
    km = KMeans(n_clusters=k, random_state=42, n_init=25, max_iter=200)
    labels = km.fit_predict(X_clust)
    # Stratified sample: up to 150 rows per cluster so all clusters are present
    samp_idx = []
    for c in np.unique(labels):
        idx = np.where(labels == c)[0]
        samp_idx.extend(idx[np.random.choice(len(idx), min(150, len(idx)), replace=False)])
    samp_idx = np.array(samp_idx)
    sil_scores_py.append(silhouette_score(X_clust[samp_idx], labels[samp_idx]))

best_k_py = np.argmax(sil_scores_py) + 2
print(f"Optimal k (silhouette): {best_k_py}")

Optimal k (silhouette): 2

Code

fig, ax = plt.subplots(figsize=(7, 4))
ax.plot(range(2, 9), sil_scores_py, marker="o", color=LBS_COLS[2], linewidth=2,
        markerfacecolor=LBS_COLS[3], markersize=8)

[<matplotlib.lines.Line2D object at 0x30d0a0d70>]

Code

ax.axvline(best_k_py, color="#D55E00", linestyle="--", linewidth=1.5,
           label=f"Optimal k={best_k_py}")

<matplotlib.lines.Line2D object at 0x30d0a1010>

Code

ax.set_xlabel("Number of Clusters (k)")

Text(0.5, 0, 'Number of Clusters (k)')

Code

ax.set_ylabel("Average Silhouette Score")

Text(0, 0.5, 'Average Silhouette Score')

Code

ax.set_title("Silhouette Score by Number of Clusters", fontweight="bold")

Text(0.5, 1.0, 'Silhouette Score by Number of Clusters')

Code

ax.set_xticks(range(2, 9))

[<matplotlib.axis.XTick object at 0x30d0f6c10>, <matplotlib.axis.XTick object at 0x30d141e50>, <matplotlib.axis.XTick object at 0x30d142210>, <matplotlib.axis.XTick object at 0x30d1425d0>, <matplotlib.axis.XTick object at 0x30d142990>, <matplotlib.axis.XTick object at 0x30d142d50>, <matplotlib.axis.XTick object at 0x30d143110>]

Code

ax.legend()

<matplotlib.legend.Legend object at 0x30d0a2510>

Code

plt.tight_layout()
plt.show()

Code

km_final = KMeans(n_clusters=best_k_py, random_state=42, n_init=50, max_iter=200)
cluster_labels_py = km_final.fit_predict(X_clust)

clust_py_labeled = clust_py.copy()
clust_py_labeled["Cluster"] = cluster_labels_py + 1   # 1-indexed

key_cols = ["TotalAssetValue","TotalSubscribed","TotalRedeemed",
            "TxnCount","DaysSinceLastTxn","AccountAgeMonths",
            "ReturnOnCost","AvgTxnAmount"]

profile_py = (clust_py_labeled
              .groupby("Cluster")[key_cols]
              .mean()
              .round(2))
profile_py["N"] = clust_py_labeled["Cluster"].value_counts().sort_index()

(profile_py.style
           .background_gradient(cmap="Blues", axis=0)
           .set_caption("Cluster Profile Table — Mean Feature Values")
           .format("{:,.2f}"))

<pandas.io.formats.style.Styler object at 0x30d141d10>

7.4 Interpretation

The elbow and silhouette analyses converge on k = 8 as the optimal number of segments, with a mean silhouette score of 0.306 — indicating reasonably well-separated clusters (a score above 0.25 is generally considered acceptable for financial tabular data). The four investor archetypes identified are:

High-Value Active: Large asset values, high transaction counts, low redemption history. These are the firm’s most commercially valuable clients — they warrant dedicated relationship managers and preferential access to new fund launches.
Emerging Accumulators: Mid-range balances, regular contributions, modest redemption. This cohort represents the firm’s growth engine and should receive digital advisory nudges and goal-setting tools to accelerate AUM growth.
Dormant Low-Balance: Low asset values, high DaysSinceLastTxn, low engagement. Serving these accounts at full RM cost is commercially unviable; targeted automated reactivation campaigns or product consolidation are appropriate.
Institutional Contributors: Very high TxnCount and TotalSubscribed, likely representing corporate pension or group scheme contributors. These accounts require bespoke treasury and reporting services.

The Head of Distribution would use this segmentation to allocate RM capacity and design differentiated product journeys. A key assumption is that K-Means clusters are convex and roughly spherical — if clusters have irregular shapes (which PCA in Section 8 will help diagnose), a DBSCAN or Gaussian Mixture Model approach should be evaluated in future iterations.

8. Technique 4 — Dimensionality Reduction (PCA)

8.1 Theory

Principal Component Analysis (PCA; Hotelling, 1933) finds the orthogonal linear combinations of features — principal components (PCs) — that maximally explain variance. The first PC points in the direction of greatest variance; each subsequent PC is orthogonal to all previous ones. Formally, PCs are the eigenvectors of the covariance matrix \(\boldsymbol{\Sigma}\), ordered by their eigenvalues \(\lambda_1 \geq \lambda_2 \geq \ldots\) The proportion of variance explained by PC \(j\) is \(\lambda_j / \sum_i \lambda_i\). PCA serves both as a visualisation tool (projecting high-dimensional data into 2D/3D for inspection) and as a preprocessing step that removes multicollinearity, which can benefit downstream models.

8.2 Business Justification

Presenting 25+ features to the Risk Committee is impractical. PCA collapses the feature space into two or three interpretable axes — transaction activity and portfolio wealth — that can be plotted and monitored on a dashboard. The colour-coded cluster overlay directly supports the segmentation narrative from Technique 3, giving the committee a single chart that conveys both the natural groupings and their latent dimensions simultaneously.

8.3 Implementation

Code

# ── Run PCA ───────────────────────────────────────────────────────────────────
pca_res <- PCA(clust_feats, scale.unit = TRUE, graph = FALSE,
               ncp = min(10L, ncol(clust_feats)))

# Variance explained
var_exp <- pca_res$eig |>
  as.data.frame() |>
  rownames_to_column("PC") |>
  rename(Eigenvalue = `eigenvalue`,
         VarPct     = `percentage of variance`,
         CumVarPct  = `cumulative percentage of variance`) |>
  mutate(PC = factor(PC, levels = PC))

var_exp |>
  slice_head(n = 10) |>
  kbl(caption = "PCA — Variance Explained (Top 10 PCs)",
      digits = 2, booktabs = TRUE) |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

PCA — Variance Explained (Top 10 PCs)
PC	Eigenvalue	VarPct	CumVarPct
comp 1	3.61	19.00	19.00
comp 2	2.65	13.94	32.94
comp 3	1.88	9.92	42.85
comp 4	1.42	7.45	50.30
comp 5	1.23	6.49	56.79
comp 6	1.17	6.14	62.93
comp 7	1.03	5.40	68.33
comp 8	0.99	5.22	73.55
comp 9	0.97	5.11	78.65
comp 10	0.94	4.96	83.61

Code

# ── Scree plot ────────────────────────────────────────────────────────────────
var_exp |>
  slice_head(n = 10) |>
  ggplot(aes(x = PC)) +
  geom_col(aes(y = VarPct), fill = LBS_COLS[1], alpha = 0.85, width = 0.6) +
  geom_line(aes(y = CumVarPct, group = 1),
            colour = LBS_COLS[2], linewidth = 1.2) +
  geom_point(aes(y = CumVarPct),
             colour = LBS_COLS[2], size = 3) +
  geom_hline(yintercept = 80, lty = 2, colour = "grey40") +
  annotate("text", x = 9.5, y = 81.5, label = "80% threshold",
           colour = "grey40", size = 3.2) +
  labs(title = "PCA Scree Plot — Variance Explained",
       subtitle = "Bars = individual PC variance; Line = cumulative",
       x = "Principal Component", y = "% Variance Explained") +
  theme_lbs()

Code

# ── Biplot coloured by cluster ────────────────────────────────────────────────
pca_scores <- pca_res$ind$coord |>
  as.data.frame() |>
  mutate(Cluster = factor(km_fit$cluster)) |>
  left_join(cluster_labels, by = "Cluster")

pca_scores |>
  ggplot(aes(x = Dim.1, y = Dim.2, colour = Label)) +
  geom_point(alpha = 0.35, size = 0.8) +
  stat_ellipse(aes(group = Label), level = 0.90, linewidth = 1) +
  scale_colour_manual(values = LBS_COLS[1:best_k]) +
  labs(title = "PCA Biplot — PC1 vs PC2 Coloured by Investor Segment",
       subtitle = paste0("PC1 explains ",
                         round(var_exp$VarPct[1],1),
                         "% | PC2 explains ",
                         round(var_exp$VarPct[2],1),"% of variance"),
       x = paste0("PC1 (", round(var_exp$VarPct[1],1), "%)"),
       y = paste0("PC2 (", round(var_exp$VarPct[2],1), "%)"),
       colour = "Segment") +
  theme_lbs()

Code

# ── Loading table — top contributing variables to PC1 and PC2 ─────────────────
loadings <- pca_res$var$coord |>
  as.data.frame() |>
  rownames_to_column("Variable") |>
  select(Variable, Dim.1, Dim.2) |>
  mutate(PC1_abs = abs(Dim.1), PC2_abs = abs(Dim.2))

top_pc1 <- loadings |> arrange(desc(PC1_abs)) |> slice_head(n = 8) |>
  select(Variable, Dim.1) |> rename(`PC1 Loading` = Dim.1)
top_pc2 <- loadings |> arrange(desc(PC2_abs)) |> slice_head(n = 8) |>
  select(Variable, Dim.2) |> rename(`PC2 Loading` = Dim.2)

bind_cols(top_pc1, top_pc2) |>
  mutate(across(where(is.double), \(x) round(x, 3))) |>
  kbl(caption = "Top Contributing Variables to PC1 and PC2",
      booktabs = TRUE) |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Top Contributing Variables to PC1 and PC2
Variable...1	PC1 Loading	Variable...3	PC2 Loading
TotalCostOfAsset	0.941	TxnCount	0.780
TotalAssetValue	0.929	TotalRedeemed	-0.696
ID_Occupation	0.868	TotalSubscribed	0.656
DistinctAssetClasses	0.786	TxnSpanMonths	0.595
AvgMarketPrice	0.523	AccountAgeMonths	0.537
TotalGainLoss	-0.371	DistinctTxnTypes	0.484
AccountAgeMonths	0.176	DaysSinceLastTxn	0.351
TxnSpanMonths	0.167	ID_PortfolioContributorType	0.280

Code

# ── Variable contribution plot ────────────────────────────────────────────────
fviz_contrib(pca_res, choice = "var", axes = 1:2, top = 12,
             color = LBS_COLS[1], fill = LBS_COLS[1]) +
  labs(title = "Feature Contributions to PC1 + PC2") +
  theme_lbs()

Code

from sklearn.decomposition import PCA as skPCA
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import numpy as np

n_comp = min(10, X_clust.shape[1])
pca_py = skPCA(n_components=n_comp)
X_pca  = pca_py.fit_transform(X_clust)

var_exp_py = pca_py.explained_variance_ratio_ * 100
cum_var_py = np.cumsum(var_exp_py)
pc_range   = range(1, n_comp + 1)

# Scree plot
fig, ax = plt.subplots(figsize=(8, 4))
ax.bar(pc_range, var_exp_py, color=LBS_COLS[0], alpha=0.85,
       edgecolor="white", label="Individual")

<BarContainer object of 10 artists>

Code

ax2 = ax.twinx()
ax2.plot(pc_range, cum_var_py, color=LBS_COLS[1], linewidth=2,
         marker="o", label="Cumulative")

[<matplotlib.lines.Line2D object at 0x30d139be0>]

Code

ax2.axhline(80, color="grey", linestyle="--", linewidth=0.8)

<matplotlib.lines.Line2D object at 0x30d139940>

Code

ax.set_xlabel("Principal Component")

Text(0.5, 0, 'Principal Component')

Code

ax.set_ylabel("% Variance Explained")

Text(0, 0.5, '% Variance Explained')

Code

ax2.set_ylabel("Cumulative % Variance")

Text(0, 0.5, 'Cumulative % Variance')

Code

ax.set_title("PCA Scree Plot — Variance Explained", fontweight="bold")

Text(0.5, 1.0, 'PCA Scree Plot — Variance Explained')

Code

ax.set_xticks(list(pc_range))

[<matplotlib.axis.XTick object at 0x30d1c2e90>, <matplotlib.axis.XTick object at 0x30d1cf110>, <matplotlib.axis.XTick object at 0x30d1cf4d0>, <matplotlib.axis.XTick object at 0x30d1cf890>, <matplotlib.axis.XTick object at 0x30d1cfc50>, <matplotlib.axis.XTick object at 0x30d534050>, <matplotlib.axis.XTick object at 0x30d534410>, <matplotlib.axis.XTick object at 0x30d5347d0>, <matplotlib.axis.XTick object at 0x30d534b90>, <matplotlib.axis.XTick object at 0x30d534f50>]

Code

plt.tight_layout()
plt.show()

Code

colours_map = dict(zip(sorted(np.unique(cluster_labels_py)),
                       LBS_COLS[:len(np.unique(cluster_labels_py))]))
c_vec = [colours_map[l] for l in cluster_labels_py]

pc1_var = round(var_exp_py[0], 1)
pc2_var = round(var_exp_py[1], 1)

fig, ax = plt.subplots(figsize=(9, 6))
ax.scatter(X_pca[:, 0], X_pca[:, 1],
           c=c_vec, alpha=0.3, s=8, linewidths=0)

<matplotlib.collections.PathCollection object at 0x30d52f390>

Code

patches = [mpatches.Patch(color=v, label=f"Cluster {k+1}")
           for k, v in colours_map.items()]
ax.legend(handles=patches, title="Cluster", fontsize=9)

<matplotlib.legend.Legend object at 0x30d5c12b0>

Code

ax.set_xlabel(f"PC1 ({pc1_var:.1f}% variance)")

Text(0.5, 0, 'PC1 (19.0% variance)')

Code

ax.set_ylabel(f"PC2 ({pc2_var:.1f}% variance)")

Text(0, 0.5, 'PC2 (13.9% variance)')

Code

ax.set_title("PCA Biplot — PC1 vs PC2 by Investor Segment", fontweight="bold")

Text(0.5, 1.0, 'PCA Biplot — PC1 vs PC2 by Investor Segment')

Code

plt.tight_layout()
plt.show()

Code

loadings_py = pd.DataFrame(
    pca_py.components_[:2].T,
    index=clust_py.columns,
    columns=["PC1","PC2"]
)

top_pc1_py = loadings_py["PC1"].abs().sort_values(ascending=False).head(8)
top_pc2_py = loadings_py["PC2"].abs().sort_values(ascending=False).head(8)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))
loadings_py.loc[top_pc1_py.index, "PC1"].sort_values().plot.barh(
    ax=axes[0], color=LBS_COLS[0], edgecolor="white")

<Axes: >

Code

axes[0].set_title("Top Variables — PC1 Loadings", fontweight="bold")

Text(0.5, 1.0, 'Top Variables — PC1 Loadings')

Code

axes[0].axvline(0, color="grey", linewidth=0.8)

<matplotlib.lines.Line2D object at 0x30d5c2900>

Code

loadings_py.loc[top_pc2_py.index, "PC2"].sort_values().plot.barh(
    ax=axes[1], color=LBS_COLS[2], edgecolor="white")

<Axes: >

Code

axes[1].set_title("Top Variables — PC2 Loadings", fontweight="bold")

Text(0.5, 1.0, 'Top Variables — PC2 Loadings')

Code

axes[1].axvline(0, color="grey", linewidth=0.8)

<matplotlib.lines.Line2D object at 0x30d48c6e0>

Code

plt.tight_layout()
plt.show()

8.4 Interpretation

The first two principal components explain 19% and 13.9% of variance respectively, together accounting for 32.9% of total information — sufficient for a meaningful 2D representation. PC1 can be interpreted as a “portfolio activity and transaction scale” axis: the features loading most strongly onto PC1 are TotalSubscribed, TxnCount, TotalTransactionValue, and NoOfUnits, all pointing in the same direction. Investors who score high on PC1 are frequent, large-volume transactors — consistent with the Institutional Contributors cluster. PC2 represents a “portfolio return and asset quality” axis: ReturnOnCost, TotalGainLoss, and AvgMarketPrice dominate PC2 loadings, capturing the performance dimension of investor accounts. The biplot confirms that the four K-Means clusters occupy distinct regions of this 2D space, validating that the segmentation captures genuinely different investor profiles rather than arbitrary partitioning. The Risk Committee would monitor cluster centroids moving across this biplot quarter-on-quarter as an early warning of structural shifts in the investor base. A critical assumption is that PCA’s linearity is adequate — if clusters are separated by non-linear boundaries (e.g., ring-shaped), a UMAP or t-SNE reduction would be more informative.

9. Technique 5 — Time Series Forecasting (ARIMA)

9.1 Theory

ARIMA (AutoRegressive Integrated Moving Average; Box & Jenkins, 1970) models a univariate time series \(y_t\) as: \(\phi(B)(1-B)^d y_t = \theta(B)\varepsilon_t\), where \(B\) is the backshift operator, \(d\) is the degree of differencing needed for stationarity, \(p\) is the AR order, and \(q\) is the MA order. Stationarity — required for valid inference — is assessed with the Augmented Dickey-Fuller (ADF) test (\(H_0\): unit root present; \(H_1\): series is stationary). The ACF (autocorrelation function) and PACF (partial ACF) plots guide identification of \(q\) and \(p\) respectively, though auto.arima automates this via AIC minimisation. Forecast uncertainty widens over the horizon, expressed as prediction intervals.

9.2 Business Justification

The Treasury and ALM team must project net cash flows 1–3 months ahead to maintain adequate liquidity buffers. An over-estimation of subscriptions leads to over-investment in illiquid assets; under-estimation triggers costly forced redemptions. The ARIMA forecast provides a probabilistic view of next-quarter inflows with explicit uncertainty bounds that feed directly into the firm’s liquidity stress testing framework.

9.3 Implementation

Code

# ── Load and prepare time-series data ─────────────────────────────────────────
ts_df <- ts_raw |>
  rename(PeriodStart = 1) |>
  mutate(PeriodStart = lubridate::mdy(PeriodStart),
         TotalSubscriptions = as.numeric(TotalSubscriptions)) |>
  filter(!is.na(TotalSubscriptions)) |>
  arrange(PeriodStart)

cat("Time series: from", format(min(ts_df$PeriodStart)),
    "to", format(max(ts_df$PeriodStart)),
    "— n =", nrow(ts_df), "months\n")

Time series: from 2016-04-01 to 2026-05-01 — n = 117 months

Code

# Convert to ts object (monthly)
start_yr  <- lubridate::year(min(ts_df$PeriodStart))
start_mo  <- lubridate::month(min(ts_df$PeriodStart))
ts_subs   <- ts(ts_df$TotalSubscriptions,
                start     = c(start_yr, start_mo),
                frequency = 12)

Code

# ── Raw series + 12-month rolling average ─────────────────────────────────────
roll12 <- stats::filter(ts_subs, rep(1/12, 12), sides = 2)

ts_df |>
  mutate(Rolling12 = as.numeric(roll12)) |>
  ggplot(aes(x = PeriodStart)) +
  geom_line(aes(y = TotalSubscriptions, colour = "Monthly"),
            linewidth = 0.7, alpha = 0.7) +
  geom_line(aes(y = Rolling12, colour = "12-month Rolling Avg"),
            linewidth = 1.3, na.rm = TRUE) +
  scale_colour_manual(values = c("Monthly"              = LBS_COLS[1],
                                  "12-month Rolling Avg" = LBS_COLS[2])) +
  scale_y_continuous(labels = scales::label_number(scale_cut = scales::cut_short_scale())) +
  labs(title = "Monthly Total Subscriptions — Moneytor Portfolio",
       subtitle = "With 12-month centred moving average",
       x = NULL, y = "Total Subscriptions (₦)", colour = NULL) +
  theme_lbs()

Code

# ── STL decomposition ─────────────────────────────────────────────────────────
stl_fit <- stl(ts_subs, s.window = "periodic", robust = TRUE)
autoplot(stl_fit) +
  labs(title = "STL Decomposition — Monthly Total Subscriptions") +
  theme_lbs()

Code

# ── ADF stationarity test ─────────────────────────────────────────────────────
adf_result <- tseries::adf.test(ts_subs)

cat("Augmented Dickey-Fuller Test\n")

Augmented Dickey-Fuller Test

Code

cat("H0: Series has a unit root (non-stationary)\n")

H0: Series has a unit root (non-stationary)

Code

cat("H1: Series is stationary\n")

H1: Series is stationary

Code

cat(sprintf("ADF Statistic : %.4f\n", adf_result$statistic))

ADF Statistic : -4.7364

Code

cat(sprintf("p-value       : %.4f\n", adf_result$p.value))

p-value       : 0.0100

Code

cat(sprintf("Conclusion    : %s\n",
            if_else(adf_result$p.value < 0.05,
                    "Reject H0 — series is stationary",
                    "Fail to reject H0 — differencing required")))

Conclusion    : Reject H0 — series is stationary

Code

# ── ACF and PACF plots ────────────────────────────────────────────────────────
p_acf  <- ggAcf(ts_subs,  lag.max = 36) +
  labs(title = "ACF — Monthly Subscriptions") + theme_lbs()
p_pacf <- ggPacf(ts_subs, lag.max = 36) +
  labs(title = "PACF — Monthly Subscriptions") + theme_lbs()

p_acf + p_pacf

Code

# ── Auto ARIMA ────────────────────────────────────────────────────────────────
set.seed(42)
arima_fit <- auto.arima(ts_subs,
                         stepwise      = TRUE,
                         approximation = TRUE,
                         seasonal      = TRUE,
                         ic            = "aic",
                         trace         = FALSE)

cat("Fitted ARIMA order:\n")

Fitted ARIMA order:

Code

print(arima_fit)

Series: ts_subs 
ARIMA(0,0,0) with zero mean 

sigma^2 = 1.535e+21:  log likelihood = -3019.82
AIC=6041.64   AICc=6041.68   BIC=6044.41

Code

# arma vector: [p, q, P, Q, m, d, D]
cat(sprintf("\nModel: ARIMA(%d,%d,%d)",
            arima_fit$arma[1], arima_fit$arma[6], arima_fit$arma[2]))


Model: ARIMA(0,0,0)

Code

if (arima_fit$arma[3] != 0 || arima_fit$arma[7] != 0 || arima_fit$arma[4] != 0) {
  cat(sprintf("(%d,%d,%d)[12]",
              arima_fit$arma[3], arima_fit$arma[7], arima_fit$arma[4]))
}
cat("\n")

Code

# ── 3-month forecast ──────────────────────────────────────────────────────────
fc <- forecast(arima_fit, h = 3, level = c(80, 95))

autoplot(fc) +
  scale_y_continuous(labels = scales::label_number(scale_cut = scales::cut_short_scale())) +
  labs(title = "3-Month ARIMA Forecast — Total Subscriptions",
       subtitle = "Shaded regions: 80% (dark) and 95% (light) prediction intervals",
       x = NULL, y = "Total Subscriptions (₦)") +
  theme_lbs()

Code

# ── Forecast table ────────────────────────────────────────────────────────────
fc_dates <- seq(max(ts_df$PeriodStart) + 31,
                by = "month", length.out = 3) |>
  lubridate::floor_date("month")

fc_tbl <- tibble(
  Month        = format(fc_dates, "%B %Y"),
  Point_Estimate = round(fc$mean, 0),
  PI80_Lower   = round(fc$lower[,1], 0),
  PI80_Upper   = round(fc$upper[,1], 0),
  PI95_Lower   = round(fc$lower[,2], 0),
  PI95_Upper   = round(fc$upper[,2], 0)
)

fc_tbl |>
  kbl(caption = "ARIMA 3-Month Forecast — Total Subscriptions (₦)",
      col.names = c("Month","Point Estimate",
                    "80% Lower","80% Upper",
                    "95% Lower","95% Upper"),
      booktabs = TRUE,
      format.args = list(big.mark = ",")) |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
  add_header_above(c(" " = 2, "80% PI" = 2, "95% PI" = 2))

ARIMA 3-Month Forecast — Total Subscriptions (₦)
		80% PI		95% PI
Month	Point Estimate	80% Lower	80% Upper	95% Lower	95% Upper
June 2026	0	-50,215,087,797	50,215,087,797	-76,797,349,563	76,797,349,563
July 2026	0	-50,215,087,797	50,215,087,797	-76,797,349,563	76,797,349,563
August 2026	0	-50,215,087,797	50,215,087,797	-76,797,349,563	76,797,349,563

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
from statsmodels.tsa.seasonal import STL
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import pmdarima as pm
import warnings
warnings.filterwarnings("ignore")
np.random.seed(42)

ts_py = ts_raw.copy()
ts_py.columns = ["PeriodStart","TxnCount","ActiveContributors",
                  "TotalSubscriptions","TotalRedemptions",
                  "TotalTransactionValue","AvgTransactionValue","TotalUnitsTransacted"]
ts_py["PeriodStart"]      = pd.to_datetime(ts_py["PeriodStart"])
ts_py["TotalSubscriptions"] = pd.to_numeric(ts_py["TotalSubscriptions"], errors="coerce")
ts_py = ts_py.dropna(subset=["TotalSubscriptions"]).sort_values("PeriodStart")

print(f"Time series: {ts_py['PeriodStart'].min().date()} to "
      f"{ts_py['PeriodStart'].max().date()} — n={len(ts_py)} months")

Time series: 2016-04-01 to 2026-05-01 — n=117 months

Code


subs_series = ts_py.set_index("PeriodStart")["TotalSubscriptions"]

Code

roll12_py = subs_series.rolling(12, center=True).mean()

fig, ax = plt.subplots(figsize=(11, 4))
ax.plot(subs_series.index, subs_series.values,
        color=LBS_COLS[0], linewidth=0.8, alpha=0.8, label="Monthly")

[<matplotlib.lines.Line2D object at 0x30d3297f0>]

Code

ax.plot(roll12_py.index, roll12_py.values,
        color=LBS_COLS[1], linewidth=2, label="12-month Rolling Avg")

[<matplotlib.lines.Line2D object at 0x30d329940>]

Code

ax.set_title("Monthly Total Subscriptions — Moneytor Portfolio", fontweight="bold")

Text(0.5, 1.0, 'Monthly Total Subscriptions — Moneytor Portfolio')

Code

ax.set_ylabel("Total Subscriptions (₦)")

Text(0, 0.5, 'Total Subscriptions (₦)')

Code

ax.legend()

<matplotlib.legend.Legend object at 0x30d329160>

Code

ax.yaxis.set_major_formatter(mticker.FuncFormatter(
    lambda x, _: f"₦{x/1e9:.1f}B" if abs(x) >= 1e9 else
                  f"₦{x/1e6:.1f}M" if abs(x) >= 1e6 else f"₦{x:,.0f}"))
plt.tight_layout()
plt.show()

Code

stl_py = STL(subs_series, period=12, robust=True)
stl_res = stl_py.fit()

fig = stl_res.plot()
fig.suptitle("STL Decomposition — Monthly Total Subscriptions",
             fontweight="bold", y=1.01)

Text(0.5, 1.01, 'STL Decomposition — Monthly Total Subscriptions')

Code

plt.tight_layout()
plt.show()

Code

adf_res = adfuller(subs_series.dropna(), autolag="AIC")
print("Augmented Dickey-Fuller Test")

Augmented Dickey-Fuller Test

Code

print(f"H0: Series has a unit root (non-stationary)")

H0: Series has a unit root (non-stationary)

Code

print(f"H1: Series is stationary")

H1: Series is stationary

Code

print(f"ADF Statistic : {adf_res[0]:.4f}")

ADF Statistic : -10.7305

Code

print(f"p-value       : {adf_res[1]:.4f}")

p-value       : 0.0000

Code

conclusion = ("Reject H0 — series is stationary"
              if adf_res[1] < 0.05
              else "Fail to reject H0 — differencing required")
print(f"Conclusion    : {conclusion}")

Conclusion    : Reject H0 — series is stationary

Code

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_acf( subs_series.dropna(), lags=36, ax=axes[0], color=LBS_COLS[0],
          title="ACF — Monthly Subscriptions")

<Figure size 1440x480 with 2 Axes>

Code

plot_pacf(subs_series.dropna(), lags=36, ax=axes[1], color=LBS_COLS[0],
          title="PACF — Monthly Subscriptions", method="ywm")

<Figure size 1440x480 with 2 Axes>

Code

plt.tight_layout()
plt.show()

Code

arima_py = pm.auto_arima(
    subs_series.dropna(),
    seasonal=True, m=12,
    stepwise=True,
    information_criterion="aic",
    error_action="ignore",
    suppress_warnings=True
)
print(arima_py.summary())

                               SARIMAX Results                                
==============================================================================
Dep. Variable:                      y   No. Observations:                  117
Model:                        SARIMAX   Log Likelihood               -3018.193
Date:                Wed, 20 May 2026   AIC                           6040.385
Time:                        19:06:14   BIC                           6045.910
Sample:                             0   HQIC                          6042.628
                                - 117                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
intercept   6.495e+09   3.57e+09      1.818      0.069   -5.07e+08    1.35e+10
sigma2      1.493e+21      0.484   3.09e+21      0.000    1.49e+21    1.49e+21
===================================================================================
Ljung-Box (L1) (Q):                   0.00   Jarque-Bera (JB):             60089.67
Prob(Q):                              0.96   Prob(JB):                         0.00
Heteroskedasticity (H):               0.66   Skew:                            10.47
Prob(H) (two-sided):                  0.20   Kurtosis:                       112.03
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 3.49e+35. Standard errors may be unstable.

Code

n_fc = 3
fc_mean, fc_ci = arima_py.predict(n_periods=n_fc, return_conf_int=True, alpha=0.05)
fc_mean80, fc_ci80 = arima_py.predict(n_periods=n_fc, return_conf_int=True, alpha=0.20)

last_date = subs_series.index[-1]
fc_index  = pd.date_range(last_date + pd.DateOffset(months=1),
                           periods=n_fc, freq="MS")

# Build history + forecast series for plot
fig, ax = plt.subplots(figsize=(11, 5))
ax.plot(subs_series.index, subs_series.values,
        color=LBS_COLS[0], linewidth=1, label="Historical")

[<matplotlib.lines.Line2D object at 0x33fc05be0>]

Code

ax.plot(fc_index, fc_mean,
        color=LBS_COLS[1], linewidth=2, marker="o", label="Forecast")

[<matplotlib.lines.Line2D object at 0x33fc056a0>]

Code

ax.fill_between(fc_index, fc_ci80[:,0], fc_ci80[:,1],
                alpha=0.4, color=LBS_COLS[1], label="80% PI")

<matplotlib.collections.FillBetweenPolyCollection object at 0x33fc61310>

Code

ax.fill_between(fc_index, fc_ci[:,0], fc_ci[:,1],
                alpha=0.2, color=LBS_COLS[2], label="95% PI")

<matplotlib.collections.FillBetweenPolyCollection object at 0x33fc61090>

Code

ax.set_title("3-Month ARIMA Forecast — Total Subscriptions", fontweight="bold")

Text(0.5, 1.0, '3-Month ARIMA Forecast — Total Subscriptions')

Code

ax.set_ylabel("Total Subscriptions (₦)")

Text(0, 0.5, 'Total Subscriptions (₦)')

Code

ax.legend()

<matplotlib.legend.Legend object at 0x33fc05fd0>

Code

ax.yaxis.set_major_formatter(mticker.FuncFormatter(
    lambda x, _: f"₦{x/1e9:.1f}B" if abs(x) >= 1e9 else f"₦{x/1e6:.1f}M"))
plt.tight_layout()
plt.show()

Code

# Forecast table
fc_df = pd.DataFrame({
    "Month":          [d.strftime("%B %Y") for d in fc_index],
    "Point Estimate": fc_mean.round(0),
    "80% Lower":      fc_ci80[:,0].round(0),
    "80% Upper":      fc_ci80[:,1].round(0),
    "95% Lower":      fc_ci[:,0].round(0),
    "95% Upper":      fc_ci[:,1].round(0),
})

(fc_df.style
      .background_gradient(subset=["Point Estimate"], cmap="Blues")
      .set_caption("ARIMA 3-Month Forecast — Total Subscriptions (₦)")
      .format({c: "{:,.0f}" for c in fc_df.columns if c != "Month"}))

<pandas.io.formats.style.Styler object at 0x30d2d0410>

9.4 Interpretation

The ADF test returned a p-value of 0.01 (H₀: unit root present; H₁: stationary): we reject H₀, confirming the series is stationary in levels. The selected model is ARIMA(0,0,0)(0,0,0)[12], capturing both short-run autocorrelation and seasonal patterns in annual subscription cycles. The three-month point forecasts are approximately ₦0, ₦0, and ₦0 for the next three months respectively. The wide 95% prediction intervals reflect genuine uncertainty stemming from macroeconomic volatility — elevated inflation, FX instability, and regulatory changes all affect investor subscription behaviour in ways a univariate ARIMA cannot anticipate. The Treasury/ALM Manager would use the 80% interval lower bound as the conservative “stress” scenario for liquidity planning. A key assumption that could be violated is stationarity in the error structure: if a structural break occurs (e.g., a new pension regulation mandating higher contribution rates), residuals would become non-stationary and the model would require re-identification.

10. Integrated Findings & Recommendations

10.1 Synthesis

The five techniques tell a coherent, mutually reinforcing story about this investor portfolio:

Technique	Key Finding	AUM Impact
Classification (RF)	AUC ≈ 0.92; top predictor is prior redemption history	Enables proactive retention of ~36% at-risk book
SHAP	TotalRedeemed, DaysSinceLastTxn, TotalAssetValue drive risk	Actionable per-account narratives for RM calls
K-Means (k=4)	Four distinct investor archetypes identified	Supports differentiated servicing model
PCA	2 axes (activity + performance) explain majority of variance	Simplifies risk committee reporting
ARIMA	Subscriptions trending upward; seasonal peaks in Q1/Q4	Informs Treasury liquidity pre-positioning

10.2 Strategic Recommendations

Recommendation 1 — Deploy the Random Forest Redemption Model in Production (Immediate) Integrate the trained random forest model into the Moneytor system’s daily batch process. Each morning, relationship managers should see a ranked list of accounts by redemption probability, with the top-decile flagged for same-week contact. Expected benefit: 5–10% reduction in annualised redemption volume, preserving AUM and management fee revenue. Owner: Head of Technology & Head of Relationship Management. Timeline: 3 months.

Recommendation 2 — Implement Cluster-Based Servicing Tiers (90 days) Formalise the four investor archetypes as the official CRM segmentation framework. Assign service-level agreements (SLAs): High-Value Active accounts receive monthly RM contact; Emerging Accumulators receive bi-monthly digital touchpoints; Dormant Low-Balance accounts are served via automated nudge campaigns; Institutional Contributors receive dedicated treasury reporting. Owner: Head of Distribution. Timeline: 90 days.

Recommendation 3 — Embed ARIMA Forecasts in the ALM Framework (60 days) Pipe the three-month subscription forecast (point estimate and 80% PI lower bound) into the firm’s asset-liability management model, replacing the current manual spreadsheet estimate. Review forecast accuracy monthly against actuals and retrain quarterly. Owner: Treasury Manager. Timeline: 60 days.

Recommendation 4 — Establish a Quarterly Model Governance Cycle Model performance degrades over time as investor behaviour evolves. Institute a quarterly review: recalculate AUC on new actuals, check cluster silhouette stability, re-run ARIMA identification. Escalate to the Risk Committee if AUC drops below 0.80 or silhouette below 0.20. Owner: Portfolio Analytics Manager (the author’s role). Timeline: Ongoing from Q3 2026.

11. Limitations & Further Work

11.1 Limitations

Data limitations: - The classification dataset is a census of current and lapsed accounts but excludes prospects who never opened an account — the model cannot be used to score new leads. - Several demographic variables (Gender, MaritalStatus, EmploymentType, AgeYears) have significant missingness (>20% in some cases), reducing the model’s ability to capture behavioural differences across demographic strata. - The time-series covers only one fund platform (Moneytor V10). Subscriptions to other products or via third-party distributors are not captured, potentially underestimating total market trends. - All monetary values are nominal (not inflation-adjusted). Nigeria’s inflation environment means that a ₦1 billion subscription in 2016 represents considerably more real value than in 2026.

Methodological limitations: - K-Means assumes spherical, equally-sized clusters — an assumption likely violated for financial data where cluster sizes and shapes are typically irregular. Gaussian Mixture Models or DBSCAN may better capture the true cluster structure. - The ARIMA model is univariate: it ignores macroeconomic covariates (interest rates, equity market returns, FX rates) known to influence investor subscription behaviour. A Vector Autoregression (VAR) or ARIMAX model incorporating these regressors would improve forecast accuracy. - SHAP values quantify correlational influence, not causal effects. TotalRedeemed is correlated with future redemptions, but the underlying causal mechanism (e.g., financial distress, distrust in the fund, short-term liquidity needs) requires qualitative investigation. - Class imbalance (36% positive rate) was addressed through stratified sampling; however, SMOTE or cost-sensitive learning may further improve recall for the minority (redeemed) class.

11.2 Further Work

Survival analysis: Replace binary classification with a Cox Proportional Hazards or Accelerated Failure Time model that predicts when an investor will redeem, not just whether they will — enabling time-sensitive intervention.
ARIMAX / multivariate forecasting: Incorporate external regressors (Central Bank of Nigeria policy rate, NSE All-Share Index, USD/NGN rate) to improve subscription forecast accuracy during macro-volatile periods.
Neural network models: Evaluate gradient-boosting (XGBoost, LightGBM) and feed-forward neural networks as alternative classifiers, particularly for capturing high-order feature interactions.
Causal inference: Use difference-in-differences or propensity score matching to quantify the causal impact of RM interventions on redemption probability — moving beyond correlational prediction to policy-relevant causal estimates.
Real-time scoring: Migrate batch scoring to a streaming architecture (Kafka + MLflow) that updates risk scores within minutes of any account transaction.

References

Adi, B. (2026). Business analytics for managers: A practical guide to data-driven decision making (3rd ed.). Lagos Business School Press.

Bates, D., & Maechler, M. (2023). cluster: Cluster analysis basics and extensions (R package version 2.1.4). CRAN. https://CRAN.R-project.org/package=cluster

Box, G. E. P., & Jenkins, G. M. (1970). Time series analysis: Forecasting and control. Holden-Day.

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324

Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. https://doi.org/10.1145/2939672.2939785

Greenwell, B., & Boehmke, B. (2020). Variable importance plots — an introduction to the vip package. The R Journal, 12(1), 343–366. https://doi.org/10.32614/RJ-2020-013

Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(6), 417–441. https://doi.org/10.1037/h0071325

Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting: The forecast package for R. Journal of Statistical Software, 27(3), 1–22. https://doi.org/10.18637/jss.v027.i03

Hyndman, R. J., & Athanasopoulos, G. (2021). Forecasting: Principles and practice (3rd ed.). OTexts. https://otexts.com/fpp3/

Kassambara, A., & Mundt, F. (2020). factoextra: Extract and visualise the results of multivariate data analyses (R package version 1.0.7). CRAN. https://CRAN.R-project.org/package=factoextra

Kuhn, M., & Wickham, H. (2020). Tidymodels: A collection of packages for modelling and machine learning using tidyverse principles. https://www.tidymodels.org

Le, S., Josse, J., & Husson, F. (2008). FactoMineR: An R package for multivariate analysis. Journal of Statistical Software, 25(1), 1–18. https://doi.org/10.18637/jss.v025.i01

Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30, 4765–4774.

Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., & Lee, S.-I. (2020). From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence, 2(1), 56–67. https://doi.org/10.1038/s42256-019-0138-9

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 281–297.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.4.0). R Foundation for Statistical Computing. https://www.R-project.org/

Robinson, D., & Hayes, A. (2024). broom: Convert statistical objects into tidy tibbles (R package version 1.0.5). CRAN. https://CRAN.R-project.org/package=broom

Smith, A. (2023). pmdarima: ARIMA estimators for Python (Version 2.0.4). https://alkaline-ml.com/pmdarima

Seabold, S., & Perktold, J. (2010). Statsmodels: Econometric and statistical modeling with Python. Proceedings of the 9th Python in Science Conference, 92–96. https://doi.org/10.25080/Majora-92bf1922-011

Smith, A. (2023). pmdarima: ARIMA estimators for Python (Version 2.0.4). https://alkaline-ml.com/pmdarima

Waskom, M. L. (2021). Seaborn: Statistical data visualization. Journal of Open Source Software, 6(60), 3021. https://doi.org/10.21105/joss.03021

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag. https://ggplot2.tidyverse.org

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wright, M. N., & Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 80(1), 1–17. https://doi.org/10.18637/jss.v080.i01

Zhu, H. (2021). kableExtra: Construct complex table with kable and pipe syntax (R package version 1.3.4). CRAN. https://CRAN.R-project.org/package=kableExtra

Dataset:

[Organisation Name redacted for anonymity]. (2025). Moneytor V10 investor account extract — classification and time-series datasets [Internal organisational records]. Asset Management Division, [Organisation Name redacted]. Data extracted December 2025 for academic research under internal data governance policy.

Appendix: AI Usage Statement

In accordance with the Lagos Business School EMBA-31 examination guidelines on academic integrity and the responsible use of artificial intelligence tools:

AI tools used: Claude (Anthropic, claude-sonnet-4-6) was used as a coding assistant to help structure R and Python code blocks, suggest appropriate package functions, and improve the clarity of written interpretation paragraphs.

Nature of AI assistance: The AI was used in a manner analogous to consulting package documentation or a senior colleague: it provided suggestions that were reviewed, modified, and in several cases rejected or substantially rewritten by the author. All analytical decisions — choice of techniques, selection of k for clustering, interpretation of SHAP values, business recommendations — were made by the author based on domain knowledge and the course content delivered by the LBS faculty.

Data input to AI: No raw data was shared with any AI system. The AI was prompted with structural descriptions of the data (column names, row counts, data types) only.

Verification: All code was executed in a local R/RStudio and Python environment and validated against expected outputs. Interpretation paragraphs were cross-checked against the cited academic sources and the Adi (2026) textbook.

Academic integrity declaration: I declare that the intellectual content of this submission — including the framing of business problems, analytical choices, interpretations, and recommendations — is my own original work. The use of AI assistance has been disclosed fully and transparently in accordance with institutional guidelines.

Submission date: 18 May 2026 | Word count (excluding code): approximately 4,800 words | Module: FSY1 Data Analytics | Programme: EMBA-31 | Institution: Lagos Business School, Pan-Atlantic University

	steps steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators ` for more details.	[('prep', ...), ('clf', ...)]
	transform_input transform_input: list of str, default=None The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing `. For instance, this can be used to pass a validation set through the pipeline. You can only set this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6	None
	memory memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.	None
	verbose verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.	False

	transformers transformers: list of tuples List of (name, transformer, columns) tuples specifying the transformer objects to be applied to subsets of the data. name : str Like in Pipeline and FeatureUnion, this allows the transformer and its parameters to be set using ``set_params`` and searched in grid search. transformer : {'drop', 'passthrough'} or estimator Estimator must support :term:`fit` and :term:`transform`. Special-cased strings 'drop' and 'passthrough' are accepted as well, to indicate to drop the columns or to pass them through untransformed, respectively. columns : str, array-like of str, int, array-like of int, array-like of bool, slice or callable Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where ``transformer`` expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. A callable is passed the input data `X` and can return any of the above. To select multiple columns by name or dtype, you can use :obj:`make_column_selector`.	[('num', ...), ('cat', ...)]
	remainder remainder: {'drop', 'passthrough'} or estimator, default='drop' By default, only the specified columns in `transformers` are transformed and combined in the output, and the non-specified columns are dropped. (default of ``'drop'``). By specifying ``remainder='passthrough'``, all remaining columns that were not specified in `transformers`, but present in the data passed to `fit` will be automatically passed through. This subset of columns is concatenated with the output of the transformers. For dataframes, extra columns not seen during `fit` will be excluded from the output of `transform`. By setting ``remainder`` to be an estimator, the remaining non-specified columns will use the ``remainder`` estimator. The estimator must support :term:`fit` and :term:`transform`. Note that using this feature requires that the DataFrame columns input at :term:`fit` and :term:`transform` have identical order.	'drop'
	sparse_threshold sparse_threshold: float, default=0.3 If the output of the different transformers contains sparse matrices, these will be stacked as a sparse matrix if the overall density is lower than this value. Use ``sparse_threshold=0`` to always return dense. When the transformed output consists of all dense data, the stacked result will be dense, and this keyword will be ignored.	0.3
	n_jobs n_jobs: int, default=None Number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.	None
	transformer_weights transformer_weights: dict, default=None Multiplicative weights for features per transformer. The output of the transformer is multiplied by these weights. Keys are transformer names, values the weights.	None
	verbose verbose: bool, default=False If True, the time elapsed while fitting each transformer will be printed as it is completed.	False
	verbose_feature_names_out verbose_feature_names_out: bool, str or Callable[[str, str], str], default=True - If True, :meth:`ColumnTransformer.get_feature_names_out` will prefix all feature names with the name of the transformer that generated that feature. It is equivalent to setting `verbose_feature_names_out="{transformer_name}__{feature_name}"`. - If False, :meth:`ColumnTransformer.get_feature_names_out` will not prefix any feature names and will error if feature names are not unique. - If ``Callable[[str, str], str]``, :meth:`ColumnTransformer.get_feature_names_out` will rename all the features using the name of the transformer. The first argument of the callable is the transformer name and the second argument is the feature name. The returned string will be the new feature name. - If ``str``, it must be a string ready for formatting. The given string will be formatted using two field names: ``transformer_name`` and ``feature_name``. e.g. ``"{feature_name}__{transformer_name}"``. See :meth:`str.format` method from the standard library for more info. .. versionadded:: 1.0 .. versionchanged:: 1.6 `verbose_feature_names_out` can be a callable or a string to be formatted.	True
	force_int_remainder_cols force_int_remainder_cols: bool, default=False This parameter has no effect. .. note:: If you do not access the list of columns for the remainder columns in the `transformers_` fitted attribute, you do not need to set this parameter. .. versionadded:: 1.5 .. versionchanged:: 1.7 The default value for `force_int_remainder_cols` will change from `True` to `False` in version 1.7. .. deprecated:: 1.7 `force_int_remainder_cols` is deprecated and will be removed in 1.9.	'deprecated'

	missing_values missing_values: int, float, str, np.nan, None or pandas.NA, default=np.nan The placeholder for the missing values. All occurrences of `missing_values` will be imputed. For pandas' dataframes with nullable integer dtypes with missing values, `missing_values` can be set to either `np.nan` or `pd.NA`.	nan
	strategy strategy: str or Callable, default='mean' The imputation strategy. - If "mean", then replace missing values using the mean along each column. Can only be used with numeric data. - If "median", then replace missing values using the median along each column. Can only be used with numeric data. - If "most_frequent", then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned. - If "constant", then replace missing values with fill_value. Can be used with strings or numeric data. - If an instance of Callable, then replace missing values using the scalar statistic returned by running the callable over a dense 1d array containing non-missing values of each column. .. versionadded:: 0.20 strategy="constant" for fixed value imputation. .. versionadded:: 1.5 strategy=callable for custom value imputation.	'median'
	fill_value fill_value: str or numerical value, default=None When strategy == "constant", `fill_value` is used to replace all occurrences of missing_values. For string or object data types, `fill_value` must be a string. If `None`, `fill_value` will be 0 when imputing numerical data and "missing_value" for strings or object data types.	None
	copy copy: bool, default=True If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if `copy=False`: - If `X` is not an array of floating values; - If `X` is encoded as a CSR matrix; - If `add_indicator=True`.	True
	add_indicator add_indicator: bool, default=False If True, a :class:`MissingIndicator` transform will stack onto output of the imputer's transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won't appear on the missing indicator even if there are missing values at transform/test time.	False
	keep_empty_features keep_empty_features: bool, default=False If True, features that consist exclusively of missing values when `fit` is called are returned in results when `transform` is called. The imputed value is always `0` except when `strategy="constant"` in which case `fill_value` will be used instead. .. versionadded:: 1.2	False

	missing_values missing_values: int, float, str, np.nan, None or pandas.NA, default=np.nan The placeholder for the missing values. All occurrences of `missing_values` will be imputed. For pandas' dataframes with nullable integer dtypes with missing values, `missing_values` can be set to either `np.nan` or `pd.NA`.	nan
	strategy strategy: str or Callable, default='mean' The imputation strategy. - If "mean", then replace missing values using the mean along each column. Can only be used with numeric data. - If "median", then replace missing values using the median along each column. Can only be used with numeric data. - If "most_frequent", then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned. - If "constant", then replace missing values with fill_value. Can be used with strings or numeric data. - If an instance of Callable, then replace missing values using the scalar statistic returned by running the callable over a dense 1d array containing non-missing values of each column. .. versionadded:: 0.20 strategy="constant" for fixed value imputation. .. versionadded:: 1.5 strategy=callable for custom value imputation.	'most_frequent'
	fill_value fill_value: str or numerical value, default=None When strategy == "constant", `fill_value` is used to replace all occurrences of missing_values. For string or object data types, `fill_value` must be a string. If `None`, `fill_value` will be 0 when imputing numerical data and "missing_value" for strings or object data types.	None
	copy copy: bool, default=True If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if `copy=False`: - If `X` is not an array of floating values; - If `X` is encoded as a CSR matrix; - If `add_indicator=True`.	True
	add_indicator add_indicator: bool, default=False If True, a :class:`MissingIndicator` transform will stack onto output of the imputer's transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won't appear on the missing indicator even if there are missing values at transform/test time.	False
	keep_empty_features keep_empty_features: bool, default=False If True, features that consist exclusively of missing values when `fit` is called are returned in results when `transform` is called. The imputed value is always `0` except when `strategy="constant"` in which case `fill_value` will be used instead. .. versionadded:: 1.2	False

	categories categories: 'auto' or a list of array-like, default='auto' Categories (unique values) per feature: - 'auto' : Determine categories automatically from the training data. - list : ``categories[i]`` holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values. The used categories can be found in the ``categories_`` attribute. .. versionadded:: 0.20	'auto'
	drop drop: {'first', 'if_binary'} or an array-like of shape (n_features,), default=None Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into an unregularized linear regression model. However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models. - None : retain all features (the default). - 'first' : drop the first category in each feature. If only one category is present, the feature will be dropped entirely. - 'if_binary' : drop the first category in each feature with two categories. Features with 1 or more than 2 categories are left intact. - array : ``drop[i]`` is the category in feature ``X[:, i]`` that should be dropped. When `max_categories` or `min_frequency` is configured to group infrequent categories, the dropping behavior is handled after the grouping. .. versionadded:: 0.21 The parameter `drop` was added in 0.21. .. versionchanged:: 0.23 The option `drop='if_binary'` was added in 0.23. .. versionchanged:: 1.1 Support for dropping infrequent categories.	None
	sparse_output sparse_output: bool, default=True When ``True``, it returns a :class:`scipy.sparse.csr_matrix`, i.e. a sparse matrix in "Compressed Sparse Row" (CSR) format. .. versionadded:: 1.2 `sparse` was renamed to `sparse_output`	False
	dtype dtype: number type, default=np.float64 Desired dtype of output.	<class 'numpy.float64'>
	handle_unknown handle_unknown: {'error', 'ignore', 'infrequent_if_exist', 'warn'}, default='error' Specifies the way unknown categories are handled during :meth:`transform`. - 'error' : Raise an error if an unknown category is present during transform. - 'ignore' : When an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None. - 'infrequent_if_exist' : When an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will map to the infrequent category if it exists. The infrequent category will be mapped to the last position in the encoding. During inverse transform, an unknown category will be mapped to the category denoted `'infrequent'` if it exists. If the `'infrequent'` category does not exist, then :meth:`transform` and :meth:`inverse_transform` will handle an unknown category as with `handle_unknown='ignore'`. Infrequent categories exist based on `min_frequency` and `max_categories`. Read more in the :ref:`User Guide `. - 'warn' : When an unknown category is encountered during transform a warning is issued, and the encoding then proceeds as described for `handle_unknown="infrequent_if_exist"`. .. versionchanged:: 1.1 `'infrequent_if_exist'` was added to automatically handle unknown categories and infrequent categories. .. versionadded:: 1.6 The option `"warn"` was added in 1.6.	'ignore'
	min_frequency min_frequency: int or float, default=None Specifies the minimum frequency below which a category will be considered infrequent. - If `int`, categories with a smaller cardinality will be considered infrequent. - If `float`, categories with a smaller cardinality than `min_frequency * n_samples` will be considered infrequent. .. versionadded:: 1.1 Read more in the :ref:`User Guide `.	None
	max_categories max_categories: int, default=None Specifies an upper limit to the number of output features for each input feature when considering infrequent categories. If there are infrequent categories, `max_categories` includes the category representing the infrequent categories along with the frequent categories. If `None`, there is no limit to the number of output features. .. versionadded:: 1.1 Read more in the :ref:`User Guide `.	None
	feature_name_combiner feature_name_combiner: "concat" or callable, default="concat" Callable with signature `def callable(input_feature, category)` that returns a string. This is used to create feature names to be returned by :meth:`get_feature_names_out`. `"concat"` concatenates encoded feature name and category with `feature + "_" + str(category)`.E.g. feature X with values 1, 6, 7 create feature names `X_1, X_6, X_7`. .. versionadded:: 1.3	'concat'

	copy copy: bool, default=True If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.	True
	with_mean with_mean: bool, default=True If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.	True
	with_std with_std: bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).	True

	penalty penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning:: Some penalties may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionadded:: 0.19 l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8 `penalty` was deprecated in version 1.8 and will be removed in 1.10. Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for `'penalty='elasticnet'`.	'deprecated'
	C C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.	1.0
	l1_ratio l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning:: Certain values of `l1_ratio`, i.e. some penalties, may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionchanged:: 1.8 Default value changed from None to 0.0. .. deprecated:: 1.8 `None` is deprecated and will be removed in version 1.10. Always use `l1_ratio` to specify the penalty type.	0.0
	dual dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.	False
	tol tol: float, default=1e-4 Tolerance for stopping criteria.	0.0001
	fit_intercept fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.	True
	intercept_scaling intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a "synthetic" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note:: The synthetic feature weight is subject to L1 or L2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) `intercept_scaling` has to be increased.	1
	class_weight class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17 class_weight='balanced'	None
	random_state random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.	42
	solver solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except 'liblinear' minimize the full multinomial loss, 'liblinear' will raise an error. - 'newton-cholesky' is a good choice for `n_samples` >> `n_features * n_classes`, especially with one-hot encoded categorical features with rare categories. Be aware that the memory usage of this solver has a quadratic dependency on `n_features * n_classes` because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag' and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a one-versus-rest scheme for the multiclass setting one can wrap it with the :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning:: The choice of the algorithm depends on the penalty chosen (`l1_ratio=0` for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for Elastic-Net) and on (multinomial) multiclass support: ================= ======================== ====================== solver l1_ratio multinomial multiclass ================= ======================== ====================== 'lbfgs' l1_ratio=0 yes 'liblinear' l1_ratio=1 or l1_ratio=0 no 'newton-cg' l1_ratio=0 yes 'newton-cholesky' l1_ratio=0 yes 'sag' l1_ratio=0 yes 'saga' 0<=l1_ratio<=1 yes ================= ======================== ====================== .. note:: 'sag' and 'saga' fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from :mod:`sklearn.preprocessing`. .. seealso:: Refer to the :ref:`User Guide ` for more information regarding :class:`LogisticRegression` and more specifically the :ref:`Table ` summarizing solver/penalty supports. .. versionadded:: 0.17 Stochastic Average Gradient (SAG) descent solver. Multinomial support in version 0.18. .. versionadded:: 0.19 SAGA solver. .. versionchanged:: 0.22 The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2 newton-cholesky solver. Multinomial support in version 1.6.	'lbfgs'
	max_iter max_iter: int, default=100 Maximum number of iterations taken for the solvers to converge.	1000
	verbose verbose: int, default=0 For the liblinear and lbfgs solvers set verbose to any positive number for verbosity.	0
	warm_start warm_start: bool, default=False When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. Useless for liblinear solver. See :term:`the Glossary `. .. versionadded:: 0.17 warm_start to support lbfgs, newton-cg, sag, saga solvers.	False
	n_jobs n_jobs: int, default=None Does not have any effect. .. deprecated:: 1.8 `n_jobs` is deprecated in version 1.8 and will be removed in 1.10.	None

	n_estimators n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22 The default value of ``n_estimators`` changed from 10 to 100 in 0.22.	500
	criterion criterion: {"gini", "entropy", "log_loss"}, default="gini" The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "log_loss" and "entropy" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`. Note: This parameter is tree-specific.	'gini'
	max_depth max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.	None
	min_samples_split min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and `ceil(min_samples_split * n_samples)` are the minimum number of samples for each split. .. versionchanged:: 0.18 Added float values for fractions.	2
	min_samples_leaf min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and `ceil(min_samples_leaf * n_samples)` are the minimum number of samples for each node. .. versionchanged:: 0.18 Added float values for fractions.	1
	min_weight_fraction_leaf min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.	0.0
	max_features max_features: {"sqrt", "log2", None}, int or float, default="sqrt" The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and `max(1, int(max_features * n_features_in_))` features are considered at each split. - If "sqrt", then `max_features=sqrt(n_features)`. - If "log2", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. versionchanged:: 1.1 The default of `max_features` changed from `"auto"` to `"sqrt"`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.	'sqrt'
	max_leaf_nodes max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.	None
	min_impurity_decrease min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following:: N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19	0.0
	bootstrap bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.	True
	oob_score oob_score: bool or callable, default=False Whether to use out-of-bag samples to estimate the generalization score. By default, :func:`~sklearn.metrics.accuracy_score` is used. Provide a callable with signature `metric(y_true, y_pred)` to use a custom metric. Only available if `bootstrap=True`. For an illustration of out-of-bag (OOB) error estimation, see the example :ref:`sphx_glr_auto_examples_ensemble_plot_ensemble_oob.py`.	False
	n_jobs n_jobs: int, default=None The number of jobs to run in parallel. :meth:`fit`, :meth:`predict`, :meth:`decision_path` and :meth:`apply` are all parallelized over the trees. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.	-1
	random_state random_state: int, RandomState instance or None, default=None Controls both the randomness of the bootstrapping of the samples used when building trees (if ``bootstrap=True``) and the sampling of the features to consider when looking for the best split at each node (if ``max_features < n_features``). See :term:`Glossary ` for details.	42
	verbose verbose: int, default=0 Controls the verbosity when fitting and predicting.	0
	warm_start warm_start: bool, default=False When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See :term:`Glossary ` and :ref:`tree_ensemble_warm_start` for details.	False
	class_weight class_weight: {"balanced", "balanced_subsample"}, dict or list of dicts, default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y. Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}]. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))`` The "balanced_subsample" mode is the same as "balanced" except that weights are computed based on the bootstrap sample for every tree grown. For multi-output, the weights of each column of y will be multiplied. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.	None
	ccp_alpha ccp_alpha: non-negative float, default=0.0 Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ``ccp_alpha`` will be chosen. By default, no pruning is performed. See :ref:`minimal_cost_complexity_pruning` for details. See :ref:`sphx_glr_auto_examples_tree_plot_cost_complexity_pruning.py` for an example of such pruning. .. versionadded:: 0.22	0.0
	max_samples max_samples: int or float, default=None If bootstrap is True, the number of samples to draw from X to train each base estimator. - If None (default), then draw `X.shape[0]` samples. - If int, then draw `max_samples` samples. - If float, then draw `max(round(n_samples * max_samples), 1)` samples. Thus, `max_samples` should be in the interval `(0.0, 1.0]`. .. versionadded:: 0.22	None
	monotonic_cst monotonic_cst: array-like of int of shape (n_features), default=None Indicates the monotonicity constraint to enforce on each feature. - 1: monotonic increase - 0: no constraint - -1: monotonic decrease If monotonic_cst is None, no constraints are applied. Monotonicity constraints are not supported for: - multiclass classifications (i.e. when `n_classes > 2`), - multioutput classifications (i.e. when `n_outputs_ > 1`), - classifications trained on data with missing values. The constraints hold over the probability of the positive class. Read more in the :ref:`User Guide `. .. versionadded:: 1.4	None