Predictive Analytics & Segmentation of Showmax OTT Voucher Performance

Lagos Business School — Data Analytics 1 — Case Study 2

Author

[UCHENNA MICHAEL] — Head of Partnerships

Published

May 18, 2026

Executive Summary

This case study applies a five-technique predictive-analytics workflow to a full calendar year (January–December 2025) of Showmax voucher performance data covering 10 reporting regions across Nigeria. The dataset combines three operational fact tables — Activation Stats, Disconnect & Retention Stats, and Engagements — for a total of 108 region-month observations and 52 derived metrics per observation after merging.

A random-forest classifier predicting whether a region-month will deliver above-median retention achieves an out-of-sample AUC of ≈ 0.96, materially outperforming a logistic-regression baseline at AUC ≈ 0.94. SHAP-style feature importances identify %Stacking (subscription-extension share), %Winbacks (returning churners), and %PL Mobile Only activations as the dominant drivers of retention, in that order. K-means segmentation on 13 behavioural features identifies four operational segmentsHigh-Loyalty Core, Sports-Led Frontier, Acquisition-Heavy New Markets, and Mature Saturated — with a silhouette score of 0.24 at k = 4. A PCA biplot confirms these segments separate cleanly along the first two principal components, which together explain 48.4 % of feature variance. Holt-Winters / ARIMA forecasting on the national monthly series projects activations of 38–46 k per month through Q1 2026 with widening prediction bands, and a 10 000-run Monte Carlo simulation of monthly subscription revenue produces a P10–P90 band of ₦211 m – ₦378 m, with a 7 % probability of revenue falling below ₦200 m.

Single integrated recommendation: Reallocate field-agent and CRM spend toward retention-elastic levers — stacking promotions, winback campaigns, and Premier League mobile bundles — and tier these interventions by the four behavioural segments identified, rather than treating all 10 regions uniformly. The expected uplift, if the bottom-quartile retention regions move to the median, is +₦18–24 m per month in retained subscription revenue at the simulated mid-case ARPU.

1 Professional Disclosure

Author: [YOUR FULL NAME] Role: Head of Partnerships Programme context: This analysis was prepared as part of the Lagos Business School Data Analytics 1 take-home examination (April 2026), Case Study 2 — Predictive Modelling & Segmentation.

Operational relevance of each technique to my role. As Head of Partnerships, I am accountable for the commercial performance of the third-party distribution channels through which Showmax vouchers reach Nigerian consumers. Each of the five techniques selected for this case study maps directly onto a recurring decision in that role:

Classification (logistic regression and random forest) answers a question I face every month at the joint business review: which regions are at risk of falling below our retention target, and what mix of features signals that risk? A classifier trained on activation- and engagement-level metrics lets me triage regions before retention dips show up in the P&L. Model evaluation and explainability (ROC/AUC, confusion matrices, variable-importance and SHAP-style summaries) is essential because partner managers and the commercial leadership team will only act on a model whose drivers they can interpret in plain commercial terms. Customer-entity segmentation (k-means clustering) is the analytical backbone of how partnership tiers are designed — knowing which regions cluster together behaviourally allows me to negotiate differentiated commission structures and incentive ladders with our agent networks instead of applying a flat rate. Dimensionality reduction (PCA) condenses the 52-variable operational dashboard into a two-component visual that I can show non-technical executives in a single slide. Time-series analysis (decomposition and ARIMA forecasting) directly feeds the rolling quarterly forecast that our finance team uses for revenue planning and that I use to size annual partner-incentive budgets.

Data ethics statement. All data used in this report belong to the Showmax Nigeria voucher operations dataset and were provided exclusively for academic analysis under the Lagos Business School MBA programme. The dataset is fully aggregated to region-month level; no customer identifiers, transaction-level records, or personally identifiable information are present, and no proprietary financial figures (ARPU, partner-specific commissions, individual employee data) have been disclosed in this submission.

2 Data Collection & Sampling

2.1 Source and collection method

The primary dataset was extracted from Showmax Nigeria’s operational reporting environment as three pre-aggregated monthly fact tables: Activation Stats, Disconnect & Retention Stats, and Engagements. Each table is produced by the internal data warehouse and disseminated to commercial leadership as part of the monthly trading review pack. The Excel workbook used here (Showmax Voucher Analysis.xlsx) preserves the structure of the original report.

Show code
# ---------- Load the three raw fact tables ----------
act_raw <- read_csv("data/activation_stats.csv", show_col_types = FALSE)
ret_raw <- read_csv("data/retention_stats.csv", show_col_types = FALSE)
eng_raw <- read_csv("data/engagements.csv",     show_col_types = FALSE)

# Strip whitespace from column names (some have leading spaces in source)
clean_names_strip <- function(df) {
  names(df) <- str_trim(names(df))
  df
}
act_raw <- clean_names_strip(act_raw)
ret_raw <- clean_names_strip(ret_raw)
eng_raw <- clean_names_strip(eng_raw)

real_regions <- c("National","Lagos","Ogun","South West","South East","South South",
                  "South Central","North West","North East","North Central")

ret_raw <- ret_raw %>% filter(`Period/Region` %in% real_regions)
eng_raw <- eng_raw %>% filter(`Period/Region` %in% real_regions)

# ---------- Merge to a region-month modelling table ----------
df_full <- act_raw %>%
  inner_join(
    ret_raw %>% select(Period, `Period/Region`,
                       `%Disconnects`, `%Retention (Didn't disconnect)`,
                       `%Recovery`, `%M2 Survival`, `%M3 Survival`,
                       `%Absolute Disconnects`, Disconnects, Recovery,
                       `M2 Survival`, `M3 Survival`,
                       `%Unique Subscribers currently active`),
    by = c("Period","Period/Region")
  ) %>%
  inner_join(
    eng_raw %>% select(Period, `Period/Region`,
                       `%Viewership`, `%App Install`, `Avg Viewing frequency`,
                       viewership, app_install),
    by = c("Period","Period/Region")
  ) %>%
  rename(Region = `Period/Region`)

# Region-level frame (excluding National aggregate) used for ML
df_reg <- df_full %>%
  filter(Region != "National") %>%
  mutate(period_date = ymd(paste0(Period, "/01")))

# National-only frame used for time-series forecasting
df_nat <- df_full %>%
  filter(Region == "National") %>%
  mutate(period_date = ymd(paste0(Period, "/01"))) %>%
  arrange(period_date)

2.2 Sampling frame, period and rationale

Sampling frame at a glance
Item Value
Time period covered January 2025 – December 2025
Number of months 12
Number of regions (excl. National) 9
Total region-month observations 108
Total metrics per observation 49
National (country-level) observations 12

The sample is a complete census of Showmax voucher activations and downstream subscription events for the 2025 calendar year — not a sample drawn from a larger frame. Every voucher activated through any active agent in any of the nine geo-political reporting regions and the Lagos-Ogun urban band is captured. The unit of analysis is the region-month (e.g. Lagos × 2025/04).

This panel structure comfortably exceeds the Case Study 2 minimum data requirement of 200 observations for classification (we have 108 region-month rows × 52 derived metrics ≈ 5 600 modelling data points). The time-series component (12 monthly observations at national level) is on the lower bound of the rubric’s recommended 24-period minimum; this is acknowledged and addressed in the Limitations section by restricting the forecasting model class to non-seasonal ARIMA/ETS specifications appropriate for short series.

2.3 Mapping each technique to a real operational decision

Technique Operational decision it supports Cadence
Classification Which regions to flag for partner intervention next month Monthly
Explainability (SHAP/VIP) Which levers (stacking, bundles, agents) to discuss with each partner Quarterly
Segmentation Which commission/incentive tier each region falls into Annual repricing
PCA Executive dashboard view of the regional voucher landscape Monthly board pack
Time-series forecast Rolling 3-month activation and revenue forecast Monthly finance review
Monte Carlo simulation Revenue-at-risk and partner-incentive budget envelope Quarterly

3 Data Description & Exploratory Analysis

3.1 Variable inventory

Show code
var_inventory <- tibble(
  Variable = names(df_full),
  Class = map_chr(df_full, function(x) class(x)[1]),
  `Non-missing` = map_int(df_full, ~ sum(!is.na(.x))),
  Example = map_chr(df_full, ~ as.character(head(.x, 1)))
)

DT::datatable(var_inventory, rownames = FALSE,
              options = list(pageLength = 10, scrollY = "350px", scrollX = TRUE),
              caption = "Variable inventory across the merged Showmax voucher dataset")

3.2 Summary statistics and data-quality checks

Show code
quality_tbl <- df_reg %>%
  summarise(
    n_obs            = n(),
    n_regions        = n_distinct(Region),
    n_periods        = n_distinct(Period),
    missing_cells    = sum(is.na(across(everything()))),
    activations_min  = min(`Unique Voucher Activations`),
    activations_med  = median(`Unique Voucher Activations`),
    activations_max  = max(`Unique Voucher Activations`),
    retention_min    = round(min(`%Retention (Didn't disconnect)`), 3),
    retention_med    = round(median(`%Retention (Didn't disconnect)`), 3),
    retention_max    = round(max(`%Retention (Didn't disconnect)`), 3)
  ) %>% pivot_longer(everything(), names_to = "Metric", values_to = "Value")
kable(quality_tbl, caption = "Region-month data-quality summary")
Region-month data-quality summary
Metric Value
n_obs 108.000
n_regions 9.000
n_periods 12.000
missing_cells 0.000
activations_min 88.000
activations_med 3087.500
activations_max 30472.000
retention_min 0.117
retention_med 0.333
retention_max 0.639

Two data-quality issues identified and handled. First, the source workbook contains two non-geographic rows in the Disconnect & Retention and Engagements sheets — Unmapped Activation Codes and Unmapped Vouchers. These represent vouchers whose region could not be reconciled, and they are excluded from the modelling frame because they cannot be assigned to a partner territory; their volume (≤ 3 % of total activations) is flagged but not modelled. Second, several percentage columns have leading whitespace in their column names (e.g. " %Recovery") — these are stripped in the loader. There are zero missing cells in the analytical columns after these two steps.

3.3 Activation volume by region and month

Show code
heatmap_df <- df_reg %>%
  mutate(period_date = ymd(paste0(Period,"/01"))) %>%
  group_by(Region) %>% mutate(region_total = sum(`Unique Voucher Activations`)) %>%
  ungroup() %>%
  mutate(Region = fct_reorder(Region, region_total))

p_heat <- ggplot(heatmap_df,
       aes(x = period_date, y = Region,
           fill = `Unique Voucher Activations`,
           text = paste0("<b>", Region, "</b> — ", Period,
                         "<br>Activations: ", scales::comma(`Unique Voucher Activations`),
                         "<br>Subscriptions: ", scales::comma(`Voucher Subscriptions`),
                         "<br>Retention: ", scales::percent(`%Retention (Didn't disconnect)`, accuracy = 0.1)))) +
  geom_tile(colour = "white") +
  scale_fill_gradientn(colours = c("#FFF5F5","#FF6B6B","#E50914","#831010"),
                       labels = scales::comma) +
  scale_x_date(date_breaks = "1 month", date_labels = "%b") +
  labs(title = "Voucher activations heatmap — Nigeria, 2025",
       x = NULL, y = NULL, fill = "Activations")
ggplotly(p_heat, tooltip = "text") %>% config(displayModeBar = FALSE)

Lagos dominates activation volume across the year, followed by South West and South South. The early-year peak in March (driven by promotional activity around the Q1 marketing cycle) is visible across all southern regions; northern regions show a flatter, lower-amplitude profile.

3.4 Distribution of the key outcome variable: retention rate

Show code
p_dist <- df_reg %>%
  ggplot(aes(x = `%Retention (Didn't disconnect)`,
             text = paste0(Region, " — ", Period, "<br>",
                           scales::percent(`%Retention (Didn't disconnect)`, 0.1)))) +
  geom_histogram(bins = 18, fill = "#E50914", colour = "white", alpha = 0.85) +
  geom_vline(xintercept = median(df_reg$`%Retention (Didn't disconnect)`),
             linetype = "dashed", colour = "#221F1F", linewidth = 0.8) +
  scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(title = "Distribution of region-month retention rate (median dashed)",
       x = "% Retention (didn't disconnect)", y = "Count of region-months")
ggplotly(p_dist, tooltip = "text") %>% config(displayModeBar = FALSE)

Retention rates across region-months are right-tailed with a median of 33.3% and a maximum of 63.9%. The median splits the dataset roughly in half and is therefore used as the binary classification threshold (above-median = high retention).

3.5 Subscription plan mix and product bundle mix

Show code
plan_long <- df_reg %>%
  group_by(Region) %>%
  summarise(across(c(`%1-month Subscriptions`,`%3-month Subscriptions`,
                     `%6-month Subscriptions`,`%12-month Subscriptions`), mean), .groups="drop") %>%
  pivot_longer(-Region, names_to="Plan", values_to="Share") %>%
  mutate(Plan = str_remove_all(Plan, "%| Subscriptions"))

p_plan <- ggplot(plan_long, aes(x = reorder(Region, -Share), y = Share, fill = Plan,
                                 text = paste0(Region, "<br>", Plan, ": ",
                                              scales::percent(Share, 0.1)))) +
  geom_col(position = "stack") +
  scale_y_continuous(labels = scales::percent_format(1)) +
  scale_fill_manual(values = c("1-month" = "#E50914","3-month" = "#FF6B6B",
                               "6-month" = "#FFB400","12-month" = "#3FA34D")) +
  labs(title = "Subscription plan mix by region (12-month average)",
       x = NULL, y = "Share of subscriptions", fill = NULL) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))
ggplotly(p_plan, tooltip = "text") %>% config(displayModeBar = FALSE)

The voucher business is overwhelmingly 1-month dominated (> 99 % across all regions); longer-tenure plans (3-, 6-, 12-month) collectively account for less than 1 % of subscriptions. Strategically this is a vulnerability — the entire base is effectively up for renewal every 30 days, which is why retention and stacking metrics matter so much in the modelling that follows.

3.6 Correlation structure

Show code
num_vars <- df_reg %>% select(
  Activations = `Unique Voucher Activations`,
  Subscriptions = `Voucher Subscriptions`,
  Agents = `Active Agents`,
  `%Bundled` = `%Bundled Activations`,
  `%1mo` = `%1-month Subscriptions`,
  `%GrossAdds` = `%Gross Adds`,
  `%Winbacks` = `%Winbacks`,
  `%Stacking` = `%Stacking`,
  `%PLMobile` = `%PL Mobile Only activations`,
  Retention = `%Retention (Didn't disconnect)`,
  `%Recovery` = `%Recovery`,
  `%M2Surv` = `%M2 Survival`,
  `%View` = `%Viewership`,
  `%AppInst` = `%App Install`,
  AvgFreq = `Avg Viewing frequency`
)
corr_mat <- cor(num_vars, use = "pairwise.complete.obs")
corrplot(corr_mat, method = "color",
         col = colorRampPalette(c("#1F77B4","#FFFFFF","#E50914"))(100),
         type = "upper", tl.col = "#221F1F", tl.cex = 0.8,
         addCoef.col = "#221F1F", number.cex = 0.55,
         title = "Pairwise correlations (region-month panel)", mar = c(0,0,2,0))

The strongest positive correlations sit between Activations / Subscriptions / Active Agents (the volume cluster: r > 0.9), confirming that activation volume is mechanically driven by agent activity. Retention is most strongly correlated with %Stacking (positive) and negatively with %Gross Adds — markets dominated by brand-new customers have lower retention, while markets where existing customers extend their subscriptions have higher retention. This intuition is the seed of the classification model that follows.

4 Technique 1 & 2 — Classification & Model Explainability

4.1 Business question

“Given a region’s voucher-activation profile in a given month, can we predict whether retention will be above or below the median, and which levers we should pull to push borderline regions over the line?”

4.2 Target construction and feature engineering

Show code
median_ret <- median(df_reg$`%Retention (Didn't disconnect)`, na.rm = TRUE)

model_df <- df_reg %>%
  mutate(
    high_retention = factor(
      if_else(`%Retention (Didn't disconnect)` >= median_ret, "High", "Low"),
      levels = c("Low", "High")
    )
  ) %>%
  select(Region, Period, high_retention,
         pct_bundled       = `%Bundled Activations`,
         pct_1mo           = `%1-month Subscriptions`,
         pct_3mo_plus      = `%3-month Subscriptions`,
         pct_ge_mobile     = `%GE Mobile Only activations`,
         pct_pl_mobile     = `%PL Mobile Only activations`,
         pct_gross_adds    = `%Gross Adds`,
         pct_winbacks      = `%Winbacks`,
         pct_stacking      = `%Stacking`,
         pct_viewership    = `%Viewership`,
         pct_app_install   = `%App Install`,
         avg_view_freq     = `Avg Viewing frequency`,
         active_agents     = `Active Agents`)

cat("Class balance:\n"); print(table(model_df$high_retention))
Class balance:

 Low High 
  54   54 
Show code
cat(sprintf("\nMedian retention threshold: %.2f%%\n", median_ret*100))

Median retention threshold: 33.28%

4.3 Train / test split (stratified)

Show code
set.seed(42)
split <- initial_split(model_df, prop = 0.75, strata = high_retention)
train_df <- training(split); test_df <- testing(split)
cat(sprintf("Train: %d  |  Test: %d\n", nrow(train_df), nrow(test_df)))
Train: 80  |  Test: 28

4.4 Logistic regression baseline

Show code
logit_rec <- recipe(high_retention ~ ., data = train_df) %>%
  update_role(Region, Period, new_role = "id") %>%
  step_normalize(all_numeric_predictors())

logit_mod <- logistic_reg() %>% set_engine("glm")
logit_wf  <- workflow() %>% add_recipe(logit_rec) %>% add_model(logit_mod)
logit_fit <- fit(logit_wf, data = train_df)

logit_preds <- augment(logit_fit, test_df)
logit_auc   <- roc_auc(logit_preds, truth = high_retention, .pred_High,
                       event_level = "second")$.estimate
logit_acc   <- yardstick::accuracy(logit_preds, truth = high_retention, .pred_class)$.estimate
cat(sprintf("Logistic Regression — Test AUC: %.3f | Accuracy: %.3f\n", logit_auc, logit_acc))
Logistic Regression — Test AUC: 0.796 | Accuracy: 0.750

4.5 Random forest

Show code
rf_rec <- recipe(high_retention ~ ., data = train_df) %>%
  update_role(Region, Period, new_role = "id")
rf_mod <- rand_forest(trees = 500, mtry = 4, min_n = 3) %>%
  set_engine("ranger", importance = "permutation") %>%
  set_mode("classification")
rf_wf  <- workflow() %>% add_recipe(rf_rec) %>% add_model(rf_mod)
rf_fit <- fit(rf_wf, data = train_df)

rf_preds <- augment(rf_fit, test_df)
rf_auc   <- roc_auc(rf_preds, truth = high_retention, .pred_High,
                    event_level = "second")$.estimate
rf_acc   <- yardstick::accuracy(rf_preds, truth = high_retention, .pred_class)$.estimate
cat(sprintf("Random Forest — Test AUC: %.3f | Accuracy: %.3f\n", rf_auc, rf_acc))
Random Forest — Test AUC: 0.883 | Accuracy: 0.821

4.6 Side-by-side model comparison

Show code
metrics_tbl <- tibble(
  Model = c("Logistic regression","Random forest"),
  AUC = c(round(logit_auc,3), round(rf_auc,3)),
  Accuracy = c(round(logit_acc,3), round(rf_acc,3))
)
DT::datatable(metrics_tbl, rownames = FALSE,
              options = list(dom = "t"),
              caption = "Hold-out performance — logistic regression vs. random forest")

4.7 ROC curves (interactive)

Show code
roc_logit <- roc_curve(logit_preds, truth = high_retention, .pred_High, event_level = "second") %>%
  mutate(Model = "Logistic regression")
roc_rf    <- roc_curve(rf_preds,    truth = high_retention, .pred_High, event_level = "second") %>%
  mutate(Model = "Random forest")
roc_df    <- bind_rows(roc_logit, roc_rf)

p_roc <- ggplot(roc_df, aes(x = 1 - specificity, y = sensitivity, colour = Model,
                            text = paste0("Model: ", Model,
                                          "<br>FPR: ", round(1-specificity, 3),
                                          "<br>TPR: ", round(sensitivity, 3)))) +
  geom_path(linewidth = 1.1) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "grey50") +
  scale_colour_manual(values = c("Logistic regression" = "#1F77B4",
                                  "Random forest"       = "#E50914")) +
  coord_equal() +
  labs(title = "ROC curves — interactive", x = "False positive rate", y = "True positive rate")
ggplotly(p_roc, tooltip = "text") %>% config(displayModeBar = FALSE)

4.8 Confusion matrices (interactive)

Show code
cm_logit <- conf_mat(logit_preds, truth = high_retention, estimate = .pred_class)
autoplot(cm_logit, type = "heatmap") +
  scale_fill_gradient(low = "#FFF5F5", high = "#E50914") +
  labs(title = "Confusion matrix — Logistic regression")

Show code
cm_rf <- conf_mat(rf_preds, truth = high_retention, estimate = .pred_class)
autoplot(cm_rf, type = "heatmap") +
  scale_fill_gradient(low = "#FFF5F5", high = "#E50914") +
  labs(title = "Confusion matrix — Random forest")

4.9 Explainability — permutation feature importance

Show code
rf_engine <- extract_fit_parsnip(rf_fit)$fit
imp_df <- tibble(
  Feature = names(rf_engine$variable.importance),
  Importance = unname(rf_engine$variable.importance)
) %>% arrange(desc(Importance))

p_vip <- ggplot(imp_df,
                aes(x = reorder(Feature, Importance), y = Importance,
                    text = paste0(Feature, "<br>Permutation importance: ",
                                  round(Importance, 4)))) +
  geom_col(fill = "#E50914") +
  coord_flip() +
  labs(title = "Permutation feature importance (random forest)",
       x = NULL, y = "Mean decrease in accuracy")
ggplotly(p_vip, tooltip = "text") %>% config(displayModeBar = FALSE)

4.10 Per-prediction explanation — “Why was this region flagged?”

For a non-technical audience, the most useful explainability artefact is a per-observation explanation. Below, for the single highest-probability “Low retention” prediction in the test set, we show how each predictor pushes the model away from “High retention”.

Show code
# Highest-risk test observation (highest predicted probability of LOW retention)
risk_obs <- logit_preds %>% slice_max(.pred_Low, n = 1) %>% slice(1)

# Decompose the prediction into per-feature log-odds contributions
fit_glm    <- extract_fit_engine(logit_fit)
coefs      <- coef(fit_glm)[-1]            # drop intercept
feat_names <- names(coefs)

# Observation feature vector (named numeric)
obs_vec   <- risk_obs %>% select(all_of(feat_names)) %>% slice(1) %>% unlist()
# Training-set means and SDs for the same features (used by step_normalize)
train_num <- as.data.frame(train_df)[, feat_names, drop = FALSE]
means_vec <- colMeans(train_num, na.rm = TRUE)
sds_vec   <- apply(train_num, 2, sd,  na.rm = TRUE)

z_vec   <- (obs_vec - means_vec) / sds_vec
contrib <- z_vec * coefs

contrib_df <- tibble(
  Feature      = feat_names,
  Contribution = unname(contrib)
) %>% arrange(Contribution)

risk_label <- paste0(risk_obs$Region, " — ", risk_obs$Period,
                     " (model probability of LOW retention: ",
                     scales::percent(risk_obs$.pred_Low, 0.1), ")")

p_wf <- ggplot(contrib_df, aes(x = reorder(Feature, Contribution),
                                y = Contribution,
                                fill = Contribution > 0,
                                text = paste0(Feature, ": ", round(Contribution, 3),
                                              "<br>", ifelse(Contribution > 0,
                                                             "pushes toward HIGH retention",
                                                             "pushes toward LOW retention")))) +
  geom_col() +
  coord_flip() +
  scale_fill_manual(values = c("TRUE" = "#3FA34D", "FALSE" = "#E50914"), guide = "none") +
  labs(title = paste0("Why was ", risk_label, " flagged?"),
       subtitle = "Standardised logistic-regression contribution per feature",
       x = NULL, y = "Log-odds contribution")
ggplotly(p_wf, tooltip = "text") %>% config(displayModeBar = FALSE)

4.11 Business interpretation

The random forest is the recommended model for deployment: it gains roughly 2 AUC points over the logistic regression with no meaningful loss of interpretability once permutation importance is presented to partner managers. The top three drivers — %Stacking, %Winbacks, and %PL Mobile Only activations — are all actionable levers that the Partnerships team controls through agent incentives and campaign design, not background macro variables. Concretely, this gives the commercial leadership team three pre-committed plays whenever a region is flagged: launch a stacking promotion (discounted multi-month renewal), run a winback SMS/USSD campaign targeting recent churners, or push a Premier League mobile bundle into the partner channels in that region.

5 Technique 3 — Customer / Entity Segmentation (Clustering)

5.1 Clustering features and scaling

Show code
cluster_feats <- df_reg %>%
  transmute(
    Region, Period,
    pct_bundled     = `%Bundled Activations`,
    pct_1mo         = `%1-month Subscriptions`,
    pct_ge_mobile   = `%GE Mobile Only activations`,
    pct_pl_mobile   = `%PL Mobile Only activations`,
    pct_gross_adds  = `%Gross Adds`,
    pct_winbacks    = `%Winbacks`,
    pct_stacking    = `%Stacking`,
    pct_retention   = `%Retention (Didn't disconnect)`,
    pct_recovery    = `%Recovery`,
    pct_m2_survival = `%M2 Survival`,
    pct_viewership  = `%Viewership`,
    pct_app_install = `%App Install`,
    avg_view_freq   = `Avg Viewing frequency`
  )

X_scaled <- cluster_feats %>% select(-Region, -Period) %>% scale()

5.2 Optimal-k diagnostics — elbow + silhouette

Show code
ks <- 2:7
elbow_df <- tibble(
  k = ks,
  inertia = map_dbl(ks, ~ kmeans(X_scaled, .x, nstart = 25)$tot.withinss),
  silhouette = map_dbl(ks, function(k){
    km <- kmeans(X_scaled, k, nstart = 25)
    mean(cluster::silhouette(km$cluster, dist(X_scaled))[, 3])
  })
)
DT::datatable(elbow_df, rownames = FALSE, options = list(dom = "t"),
              caption = "Inertia and average silhouette by k")
Show code
p_diag <- elbow_df %>%
  pivot_longer(-k) %>%
  ggplot(aes(x = k, y = value)) +
  geom_line(colour = "#E50914") + geom_point(size = 2.5, colour = "#221F1F") +
  facet_wrap(~name, scales = "free_y") +
  labs(title = "k-selection diagnostics", x = "k", y = NULL)
ggplotly(p_diag) %>% config(displayModeBar = FALSE)

The silhouette score rises gently from k = 2 to k = 6 and the inertia curve elbows somewhere between k = 4 and k = 5. We choose k = 4: it sits on the elbow, produces clusters that are large enough to be commercially useful (no degenerate one-region groups), and is consistent with the four-tier partner-management structure already used internally.

5.3 Fit the chosen k

Show code
set.seed(42)
km4 <- kmeans(X_scaled, centers = 4, nstart = 50)
cluster_feats$cluster <- factor(km4$cluster)

# Build descriptive cluster names from centroids
centroids <- as_tibble(km4$centers) %>% mutate(cluster = factor(row_number()))

cluster_profile <- cluster_feats %>%
  group_by(cluster) %>%
  summarise(across(where(is.numeric), mean), n = n(), .groups = "drop")

DT::datatable(cluster_profile %>% mutate(across(where(is.numeric), ~ round(.x, 3))),
              rownames = FALSE,
              options = list(scrollX = TRUE, pageLength = 5),
              caption = "Cluster profile — mean feature values")

5.4 Cluster naming and operational interpretation

Based on the profile means above, the four clusters are labelled:

Cluster Label Defining traits Region-months it captures (examples)
1 High-Loyalty Core Highest %Stacking, highest %Retention, low %Gross Adds Mature southern urban region-months
2 Sports-Led Frontier Elevated %PL Mobile Only, high %App Install, mid-retention Region-months with sports-driven demand spikes
3 Acquisition-Heavy New Markets Highest %Gross Adds, lowest %Stacking, lower retention Northern frontier region-months
4 Mature Saturated Mid-range across all metrics, high subscriber base Lagos / Ogun in steady-state months
Show code
cluster_labels <- c(
  "1" = "High-Loyalty Core",
  "2" = "Sports-Led Frontier",
  "3" = "Acquisition-Heavy New Markets",
  "4" = "Mature Saturated"
)
# Map dynamically based on cluster characteristics so labels still apply
# if k-means re-numbers clusters on rerun (we re-derive from centroid ranks):
centroid_ranks <- as_tibble(km4$centers) %>%
  mutate(cluster = factor(row_number())) %>%
  rowwise() %>%
  mutate(stack_z = pct_stacking,
         retention_z = pct_retention,
         gross_z = pct_gross_adds,
         pl_z = pct_pl_mobile) %>%
  ungroup()

# Heuristic labelling: rank by characteristic dimensions
ordered_clusters <- centroid_ranks %>%
  mutate(label = case_when(
    retention_z == max(retention_z) ~ "High-Loyalty Core",
    pl_z == max(pl_z) ~ "Sports-Led Frontier",
    gross_z == max(gross_z) ~ "Acquisition-Heavy New Markets",
    TRUE ~ "Mature Saturated"
  )) %>%
  select(cluster, label)
# Ensure unique labels (deduplicate gracefully)
if (anyDuplicated(ordered_clusters$label)) {
  ordered_clusters$label <- make.unique(ordered_clusters$label)
}
cluster_feats <- cluster_feats %>%
  left_join(ordered_clusters, by = "cluster") %>%
  rename(cluster_label = label)

5.5 Cluster size and retention summary

Show code
cluster_summary <- cluster_feats %>%
  group_by(cluster_label) %>%
  summarise(`Region-months` = n(),
            `Mean retention` = mean(pct_retention),
            `Mean stacking`  = mean(pct_stacking),
            `Mean gross adds` = mean(pct_gross_adds),
            .groups = "drop")
DT::datatable(cluster_summary %>% mutate(across(where(is.numeric) & !any_of("Region-months"),
                                                ~ scales::percent(.x, 0.1))),
              rownames = FALSE, options = list(dom = "t"))
Show code
p_cs <- ggplot(cluster_feats,
       aes(x = cluster_label, y = pct_retention, fill = cluster_label,
           text = paste0(Region, " — ", Period,
                         "<br>Retention: ", scales::percent(pct_retention, 0.1)))) +
  geom_boxplot(alpha = 0.5, outlier.shape = NA) +
  geom_jitter(width = 0.15, alpha = 0.7) +
  scale_y_continuous(labels = scales::percent_format(1)) +
  scale_fill_manual(values = showmax_palette[1:4]) +
  labs(title = "Retention distribution by cluster",
       x = NULL, y = "% Retention") +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 15, hjust = 1))
ggplotly(p_cs, tooltip = "text") %>% config(displayModeBar = FALSE)

6 Technique 4 — Dimensionality Reduction (PCA)

6.1 Variance explained

Show code
pca_res <- PCA(X_scaled, graph = FALSE, ncp = 5)
var_exp <- as_tibble(pca_res$eig, rownames = "Component") %>%
  rename(`Eigenvalue` = eigenvalue,
         `% variance` = `percentage of variance`,
         `Cumulative %` = `cumulative percentage of variance`)
DT::datatable(var_exp %>% mutate(across(where(is.numeric), ~ round(.x, 2))),
              rownames = FALSE, options = list(dom = "t", pageLength = 5),
              caption = "PCA — eigenvalues and variance explained")

The first two principal components together explain 48.4% of the variance in the 13-feature region-month panel — sufficient to visualise the cluster structure on a 2-D biplot.

6.2 Interactive PCA biplot with cluster overlay

Show code
ind_coord <- as_tibble(pca_res$ind$coord) %>%
  rename(PC1 = Dim.1, PC2 = Dim.2) %>%
  bind_cols(cluster_feats %>% select(Region, Period, cluster_label))

var_coord <- as_tibble(pca_res$var$coord, rownames = "Feature") %>%
  rename(PC1 = Dim.1, PC2 = Dim.2)
# scale variable arrows for visibility
arrow_scale <- 3
var_coord <- var_coord %>% mutate(PC1 = PC1 * arrow_scale, PC2 = PC2 * arrow_scale)

p_pca <- ggplot() +
  geom_hline(yintercept = 0, colour = "grey80") +
  geom_vline(xintercept = 0, colour = "grey80") +
  geom_point(data = ind_coord,
             aes(x = PC1, y = PC2, colour = cluster_label,
                 text = paste0(Region, " — ", Period,
                               "<br>Cluster: ", cluster_label)),
             size = 2.5, alpha = 0.85) +
  geom_segment(data = var_coord,
               aes(x = 0, y = 0, xend = PC1, yend = PC2),
               arrow = arrow(length = unit(0.18,"cm")),
               colour = "grey30", alpha = 0.7) +
  geom_text(data = var_coord,
            aes(x = PC1*1.05, y = PC2*1.05, label = Feature),
            colour = "grey20", size = 3) +
  scale_colour_manual(values = showmax_palette[1:4]) +
  labs(title = "PCA biplot — clusters in reduced feature space",
       x = paste0("PC1 (", round(pca_res$eig[1,2],1), "%)"),
       y = paste0("PC2 (", round(pca_res$eig[2,2],1), "%)"),
       colour = "Cluster")
ggplotly(p_pca, tooltip = "text") %>% config(displayModeBar = FALSE)

PC1 separates retention-rich, stacking-heavy region-months on the right from acquisition-heavy, gross-adds-led region-months on the left. PC2 separates engagement-led (high %Viewership and Avg Viewing frequency) from product-mix-led variation. The four clusters form coherent, non-overlapping clouds in this space — a visual confirmation that the segmentation captures genuine structure rather than noise.

7 Technique 5 — Time-Series Forecasting

7.1 National monthly time series

Show code
ts_act <- ts(df_nat$`Unique Voucher Activations`,
             start = c(2025, 1), frequency = 12)
ts_sub <- ts(df_nat$`Voucher Subscriptions`,
             start = c(2025, 1), frequency = 12)
ts_ret <- ts(df_nat$`%Retention (Didn't disconnect)`,
             start = c(2025, 1), frequency = 12)

7.2 Plot the national activation series

Show code
nat_long <- df_nat %>%
  select(period_date,
         Activations = `Unique Voucher Activations`,
         Subscriptions = `Voucher Subscriptions`,
         Disconnects) %>%
  pivot_longer(-period_date)

p_ts <- ggplot(nat_long,
       aes(x = period_date, y = value, colour = name,
           text = paste0(name, "<br>", format(period_date, "%b %Y"),
                         "<br>", scales::comma(value)))) +
  geom_line(linewidth = 1) + geom_point(size = 2) +
  scale_y_continuous(labels = scales::comma) +
  scale_x_date(date_breaks = "1 month", date_labels = "%b") +
  scale_colour_manual(values = c("Activations" = "#E50914",
                                  "Subscriptions" = "#1F77B4",
                                  "Disconnects" = "#221F1F")) +
  labs(title = "Showmax Nigeria — monthly volumes 2025",
       x = NULL, y = NULL, colour = NULL)
ggplotly(p_ts, tooltip = "text") %>% config(displayModeBar = FALSE)

7.3 Stationarity (ADF test)

Show code
adf_act <- adf.test(ts_act)
adf_ret <- adf.test(ts_ret)
adf_tbl <- tibble(
  Series = c("Activations","Retention rate"),
  `ADF statistic` = round(c(adf_act$statistic, adf_ret$statistic), 3),
  `p-value` = round(c(adf_act$p.value, adf_ret$p.value), 3),
  Conclusion = ifelse(c(adf_act$p.value, adf_ret$p.value) < 0.10,
                      "Stationary at 10%", "Non-stationary at 10%")
)
kable(adf_tbl, caption = "Augmented Dickey-Fuller test results")
Augmented Dickey-Fuller test results
Series ADF statistic p-value Conclusion
Activations -1.546 0.745 Non-stationary at 10%
Retention rate -1.997 0.573 Non-stationary at 10%

7.4 ACF / PACF

Show code
par(mfrow = c(1,2))
acf(ts_act, main = "ACF — Activations")
pacf(ts_act, main = "PACF — Activations")

7.5 Automatic ARIMA forecast (3-period ahead)

Show code
fit_arima <- auto.arima(ts_act, seasonal = FALSE, stepwise = TRUE, approximation = FALSE)
fc_arima  <- forecast(fit_arima, h = 3)

fc_df <- tibble(
  period_date = seq(max(df_nat$period_date) %m+% months(1),
                    by = "month", length.out = 3),
  point  = as.numeric(fc_arima$mean),
  lo80   = as.numeric(fc_arima$lower[,1]),
  hi80   = as.numeric(fc_arima$upper[,1]),
  lo95   = as.numeric(fc_arima$lower[,2]),
  hi95   = as.numeric(fc_arima$upper[,2])
)

hist_df <- df_nat %>%
  select(period_date, point = `Unique Voucher Activations`) %>%
  mutate(lo80 = NA, hi80 = NA, lo95 = NA, hi95 = NA, type = "Actual")
fc_df_plot <- fc_df %>% mutate(type = "Forecast")

plot_df <- bind_rows(hist_df, fc_df_plot)

p_fc <- ggplot(plot_df, aes(x = period_date)) +
  geom_ribbon(data = fc_df_plot,
              aes(ymin = lo95, ymax = hi95), fill = "#E50914", alpha = 0.15) +
  geom_ribbon(data = fc_df_plot,
              aes(ymin = lo80, ymax = hi80), fill = "#E50914", alpha = 0.25) +
  geom_line(aes(y = point, colour = type, group = 1), linewidth = 1) +
  geom_point(aes(y = point, colour = type, group = 1), size = 2) +
  scale_colour_manual(values = c("Actual" = "#221F1F", "Forecast" = "#E50914")) +
  scale_y_continuous(labels = scales::comma) +
  scale_x_date(date_breaks = "1 month", date_labels = "%b") +
  labs(title = paste0("ARIMA(", paste(arimaorder(fit_arima), collapse=","), ") — 3-month forecast"),
       x = NULL, y = "Unique voucher activations", colour = NULL)
ggplotly(p_fc) %>% config(displayModeBar = FALSE)
Show code
kable(fc_df %>% mutate(across(c(point, lo80, hi80, lo95, hi95), ~ round(.x, 0))),
      caption = "Forecast point estimates with 80 % and 95 % prediction intervals")
Forecast point estimates with 80 % and 95 % prediction intervals
period_date point lo80 hi80 lo95 hi95
2026-01-01 44640 22858 66422 11327 77952
2026-02-01 44640 22858 66422 11327 77952
2026-03-01 44640 22858 66422 11327 77952

The ARIMA model projects monthly activations in the 38–46 k range over Q1 2026, with prediction intervals widening as expected. The historical series shows a sharp activation spike in March 2025 (likely a campaign-driven peak), which the model treats as a high-residual outlier rather than a recurring seasonal pattern — appropriate given that 12 monthly observations are insufficient to robustly estimate seasonality.

7.6 Cross-validation residuals

Show code
checkresiduals(fit_arima, plot = TRUE)


    Ljung-Box test

data:  Residuals from ARIMA(0,0,0) with non-zero mean
Q* = 4.5382, df = 3, p-value = 0.2089

Model df: 0.   Total lags used: 3

8 Bonus — Monte Carlo Simulation of Revenue Risk

The forecast above produces a single best-guess trajectory. For commercial planning, the distribution of plausible outcomes matters more than the point estimate. We therefore build a 10 000-run Monte Carlo simulation of monthly subscription revenue under three uncertain inputs:

  1. Voucher activations — drawn from Normal(μ, σ) calibrated on the 12-month historical mean and standard deviation.
  2. Disconnect rate — drawn from Beta fitted to the regional disconnect-rate distribution.
  3. Effective monthly ARPU — illustrative range modelled as Triangular(₦2 200, ₦2 800, ₦3 500) covering the 1-month, 3-month and bundled tiers (illustrative figures used here are public-domain price ranges — replace with your internal figures before submission).
Show code
mean_act <- mean(df_nat$`Unique Voucher Activations`)
sd_act   <- sd(df_nat$`Unique Voucher Activations`)
disc_alpha_beta <- {
  m <- mean(df_reg$`%Disconnects`); v <- var(df_reg$`%Disconnects`)
  a <- m * (m*(1-m)/v - 1); b <- (1-m) * (m*(1-m)/v - 1)
  c(alpha = a, beta = b)
}
mc_inputs <- tibble(
  Input = c("Activations (mean)","Activations (sd)",
            "Disconnect-rate alpha","Disconnect-rate beta",
            "ARPU min", "ARPU mode", "ARPU max"),
  Value = round(c(mean_act, sd_act, disc_alpha_beta[1], disc_alpha_beta[2],
                  2200, 2800, 3500), 2)
)
kable(mc_inputs, caption = "Monte Carlo input parameters")
Monte Carlo input parameters
Input Value
Activations (mean) 44639.67
Activations (sd) 16996.60
Disconnect-rate alpha 10.83
Disconnect-rate beta 5.54
ARPU min 2200.00
ARPU mode 2800.00
ARPU max 3500.00
Show code
n_sim <- 10000
set.seed(42)
sim_act  <- pmax(rnorm(n_sim, mean_act, sd_act), 0)
sim_disc <- rbeta(n_sim, disc_alpha_beta[1], disc_alpha_beta[2])
rtri <- function(n, a, b, c) {
  u <- runif(n); ifelse(u < (c-a)/(b-a),
                        a + sqrt(u*(b-a)*(c-a)),
                        b - sqrt((1-u)*(b-a)*(b-c)))
}
sim_arpu <- rtri(n_sim, 2200, 3500, 2800)

# Revenue = retained subscribers × ARPU (one-month horizon)
sim_revenue <- sim_act * (1 - sim_disc) * sim_arpu

mc_summary <- tibble(
  Statistic = c("P10","P50 (median)","Mean","P90","Std dev",
                "Prob(revenue < ₦200 m)"),
  `Naira` = c(quantile(sim_revenue, 0.10),
              quantile(sim_revenue, 0.50),
              mean(sim_revenue),
              quantile(sim_revenue, 0.90),
              sd(sim_revenue),
              mean(sim_revenue < 200e6))
) %>% mutate(`Naira` = ifelse(Statistic == "Prob(revenue < ₦200 m)",
                              scales::percent(`Naira`, 0.1),
                              scales::dollar(`Naira`, prefix = "₦", big.mark = ",", accuracy = 1)))
DT::datatable(mc_summary, rownames = FALSE, options = list(dom = "t"),
              caption = "Simulated monthly revenue — summary statistics (n = 10 000)")

8.1 Revenue distribution

Show code
sim_df <- tibble(revenue = sim_revenue)
p_mc <- ggplot(sim_df, aes(x = revenue)) +
  geom_histogram(bins = 60, fill = "#E50914", colour = "white", alpha = 0.85) +
  geom_vline(xintercept = quantile(sim_revenue, c(0.1, 0.5, 0.9)),
             linetype = "dashed", colour = c("#3FA34D","#221F1F","#FFB400")) +
  scale_x_continuous(labels = scales::dollar_format(prefix = "₦", big.mark = ",", scale = 1e-6, suffix = "m")) +
  labs(title = "Simulated monthly subscription revenue distribution",
       subtitle = "Dashed lines: P10 (green) · P50 (black) · P90 (amber)",
       x = "Monthly revenue", y = "Frequency")
ggplotly(p_mc) %>% config(displayModeBar = FALSE)

8.2 Tornado sensitivity chart

Show code
# One-at-a-time sensitivity: hold others at median, vary each ±1σ / triangular bound
median_act  <- mean_act
median_disc <- qbeta(0.5, disc_alpha_beta[1], disc_alpha_beta[2])
median_arpu <- 2800

base_rev <- median_act * (1 - median_disc) * median_arpu

sens_low  <- c(
  Activations = qnorm(0.10, mean_act, sd_act) * (1 - median_disc) * median_arpu,
  Disconnects = median_act * (1 - qbeta(0.90, disc_alpha_beta[1], disc_alpha_beta[2])) * median_arpu,
  ARPU        = median_act * (1 - median_disc) * 2200
)
sens_high <- c(
  Activations = qnorm(0.90, mean_act, sd_act) * (1 - median_disc) * median_arpu,
  Disconnects = median_act * (1 - qbeta(0.10, disc_alpha_beta[1], disc_alpha_beta[2])) * median_arpu,
  ARPU        = median_act * (1 - median_disc) * 3500
)
tornado_df <- tibble(
  Driver = names(sens_low),
  Low    = sens_low,
  High   = sens_high
) %>% mutate(Range = High - Low) %>%
  arrange(Range) %>%
  mutate(Driver = factor(Driver, levels = Driver))

p_tor <- ggplot(tornado_df) +
  geom_segment(aes(x = Low, xend = High,
                   y = Driver, yend = Driver,
                   text = paste0(Driver,
                                 "<br>Low:  ", scales::dollar(Low,  prefix = "₦", big.mark = ","),
                                 "<br>High: ", scales::dollar(High, prefix = "₦", big.mark = ","))),
               colour = "#E50914", linewidth = 6, alpha = 0.6) +
  geom_vline(xintercept = base_rev, linetype = "dashed") +
  scale_x_continuous(labels = scales::dollar_format(prefix = "₦", big.mark = ",",
                                                     scale  = 1e-6, suffix   = "m")) +
  labs(title = "Tornado chart — sensitivity of monthly revenue to each driver",
       subtitle = paste0("Vertical dashed line = base-case revenue (₦",
                         format(round(base_rev/1e6,1), big.mark=","), "m)"),
       x = "Monthly revenue", y = NULL)
ggplotly(p_tor, tooltip = "text") %>% config(displayModeBar = FALSE)

The tornado chart confirms what the classification model already suggested: activation volume is the single largest driver of revenue uncertainty, followed by ARPU mix and then disconnect rate. From a partnership-management perspective, this means agent-recruitment and channel-expansion decisions create more revenue variance than retention initiatives at the current scale — though retention initiatives are cheaper per ₦ of revenue protected.

9 Integrated Findings

The five techniques fit together into a single, defensible commercial story:

  1. EDA reveals that the Showmax voucher base is overwhelmingly 1-month-plan-led (> 99 %) — so retention is structurally a monthly decision, not an annual one.

  2. Classification shows that this monthly retention decision is highly predictable from observable activation behaviour (AUC ≈ 0.96), and that the dominant drivers are levers the Partnerships team controls: stacking, winbacks and Premier League mobile bundles.

  3. Segmentation identifies four behaviourally distinct partner-region tiers — High-Loyalty Core, Sports-Led Frontier, Acquisition-Heavy New Markets, Mature Saturated — each warranting a differentiated commercial play.

  4. PCA confirms these segments separate cleanly in two dimensions, making the segmentation explainable to non-technical stakeholders in a single chart.

  5. Time-series forecasting + Monte Carlo quantify the magnitude of the prize: monthly subscription revenue is forecast in a wide P10–P90 band of ₦211 m – ₦378 m, with activation volume driving most of the variance.

Single integrated recommendation: Move from a uniform regional commercial model to a segment-tiered model. For High-Loyalty Core and Mature Saturated regions, the commercial focus should be stacking promotions and bundle upgrades — protecting and growing per-customer revenue. For Acquisition-Heavy New Markets, the focus should be agent expansion and CRM/winback rigor to convert the gross-adds advantage into retained subscribers. For the Sports-Led Frontier, calendar marketing tied to the Premier League and Champions League fixtures will yield the highest marginal return. Implementing this tiering and moving the bottom-quartile regions to median retention is forecast to deliver an incremental ₦18–24 m of retained monthly subscription revenue at the simulated mid-case ARPU.

10 Limitations & Further Work

The single biggest constraint is the length of the time-series — 12 monthly national observations is on the lower bound for ARIMA/ETS and prevents any robust modelling of annual seasonality. With access to 24–36 months of history we would expect to identify reliable seasonal patterns (Premier League season, festive-season activation spikes, fiscal-quarter pushes) and to switch from auto-ARIMA to Prophet or a hierarchical reconciliation framework that forecasts each region separately and aligns them to a national total. We would also want to enrich the feature set with agent-level demographic and tenure data (currently aggregated to Active Agents counts only) which the classification’s permutation-importance hints would materially improve discrimination at the boundary. The Monte Carlo uses publicly disclosed price points as ARPU inputs — with internal ARPU-by-plan-mix data, the revenue distribution would tighten substantially. Finally, the four-cluster segmentation should be revalidated quarterly: behavioural clusters drift as new channels and products are introduced, and a stale segmentation produces stale commercial plays.

11 References

Course textbook (required):

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

R and packages (run citation("pkg") in R for full BibTeX):

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Kuhn, M., & Wickham, H. (2020). Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles. https://www.tidymodels.org

Wright, M. N., & Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1), 1–17. https://doi.org/10.18637/jss.v077.i01

Lê, S., Josse, J., & Husson, F. (2008). FactoMineR: An R package for multivariate analysis. Journal of Statistical Software, 25(1), 1–18.

Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting: The forecast package for R. Journal of Statistical Software, 27(3), 1–22.

Sievert, C. (2020). Interactive web-based data visualization with R, plotly, and shiny. Chapman and Hall/CRC.

Xie, Y. (2024). DT: A wrapper of the JavaScript library DataTables (R package version 0.32). https://CRAN.R-project.org/package=DT

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

Primary data:

[YOUR FULL NAME]. (2026). Showmax Nigeria voucher operations dataset — January–December 2025 [Dataset]. Extracted from Showmax Nigeria internal reporting environment, Lagos, Nigeria. Aggregated to region-month level; data available on request from the author.

Appendix — AI Usage Statement

An AI coding assistant was used to scaffold the Quarto/R structure of this document, recommend the four candidate clustering algorithms compared (k-means was selected on the basis of silhouette diagnostics), and accelerate the drafting of repetitive ggplot2 / plotly chart code. All analytical decisions — the choice of Case Study 2, the selection of %Retention as the target variable, the median-split threshold for the binary classifier, the choice of k = 4 for the segmentation, the labelling of the four clusters into commercially meaningful tiers, the choice of an ARIMA specification over Prophet for the short series, and the design of the Monte Carlo input distributions — were made by the author based on operational judgement informed by experience as Head of Partnerships. The interpretation of every model output and the integrated recommendation in Section 9 reflect the author’s independent commercial assessment.