Exploratory & Inferential Analytics of Showmax OTT Voucher Performance

Voucher Activation, Subscription Dynamics & Customer Behaviour — Nigeria 2025

Author

Uchenna Michael

Published

May 20, 2026


1 Executive Summary

Note

Business Problem: Showmax’s Nigeria voucher programme distributes time-limited subscription codes through a nationwide network of agents. Management needs to understand where vouchers convert best, how long subscribers stay active, what products drive the most value, and which operational levers most predict retention — in order to focus partner investment and reduce churn.

Data: Monthly aggregated voucher-performance records covering January – December 2025 across 9 Nigerian geo-regions (plus Lagos), sourced from Showmax’s internal distribution management system. The combined dataset spans three analytical sheets — Activation Stats, Disconnect & Retention Stats, and Engagement Stats — containing 132 region-month observations and 60+ variables.

Key Findings: 1. National activations peaked in March 2025 (82,593) and declined sharply through H2, with December recording the year’s low (36,738) — signalling seasonal demand concentration. 2. Lagos dominates — consistently contributing ~37–42% of national activations but exhibiting proportionally higher disconnect rates (45–57%) than the national average. 3. The 1-month subscription is near-universal (>99% of all plans), limiting long-term revenue predictability and creating a recurring monthly churn pressure. 4. Stacking (renewal before expiry) is the dominant customer behaviour in most months (peaking at 49% in December), overtaking Gross Adds — a positive signal of active subscriber loyalty. 5. A significant negative correlation (r = −0.62, p < 0.01) exists between the percentage of bundled activations and retention rate, suggesting bundle customers churn faster — an actionable finding for partnership pricing. 6. Regression analysis shows that Active Agent count is the strongest predictor of activation volume (β = 24.7, p < 0.001), reinforcing the importance of field-force investment in partner channels.

Recommendation: Migrate the top-30% of agents by volume to a 3- or 6-month subscription push, supported by agent incentives, to lengthen subscriber tenure and reduce monthly churn pressure. Prioritise the South Central and North East regions for agent recruitment — these show the highest survival rates per active agent.


2 Professional Disclosure

2.1 My Role and Organisational Context

I serve as Head of Partnerships at Showmax Nigeria, the African streaming subsidiary of MultiChoice Group. In this capacity, I am responsible for designing, managing, growing, and optimising the subscription distribution channel — a network of field agents, telco partners, and retail resellers who sell Showmax subscription vouchers directly to end-consumers across Nigeria’s 36 states and FCT.

My day-to-day responsibilities include:

  • Partner recruitment and onboarding — identifying and contracting distribution agents across geo-regions aligned with Showmax’s commercial expansion strategy.
  • Activation performance tracking — monitoring weekly and monthly voucher redemption rates, subscription mix, and geographic spread to identify underperforming regions and agents.
  • Retention and churn analysis — working with the data team to measure how long voucher-activated subscribers remain on the platform before disconnecting, and designing interventions to improve survival rates.
  • Bundle and product strategy — negotiating multi-product voucher bundles (e.g., General Entertainment + Premier League packages) with distribution partners and assessing their commercial performance.
  • Agent incentive design — structuring commission and bonus frameworks based on activation volume, subscription type mix, and retention outcomes.

2.2 Technique-to-Role Mapping

Technique 1 — Exploratory Data Analysis (EDA): Before every quarterly business review, I conduct an exploratory review of activation data to identify anomalies (e.g., sudden drops in a region), data quality issues (missing agent IDs, unconfirmed activations), and distributional patterns. Understanding the shape and spread of activation data is a prerequisite for any commercial recommendation I make to senior management.

Technique 2 — Data Visualisation: My role requires presenting voucher performance to both technical peers and non-technical executives (Country Manager, CFO). Effective visualisation — monthly trend lines, regional heatmaps, subscription mix charts — is the primary medium through which I communicate performance stories. I produce a monthly “Partnership Health Dashboard” for the C-suite that distils 30+ metrics into 6–8 charts.

Technique 3 — Hypothesis Testing: I routinely test whether observed differences between regions or time periods are statistically significant or merely noise. For example: Does Lagos genuinely underperform on retention compared to other regions, or is the difference within expected variance? Hypothesis testing provides the evidential rigour to escalate issues and secure budget.

Technique 4 — Correlation Analysis: Partnership decisions — e.g., whether to expand the bundled voucher programme — require understanding relationships between variables: does higher bundling correlate with better or worse retention? Do regions with more active agents show higher viewership? Correlation analysis surfaces these relationships and guides data-driven negotiations with partners.

Technique 5 — Linear Regression: I use regression to quantify the marginal impact of partnership inputs (active agent count, bundle ratio, subscription type) on business outcomes (activation volume, retention rate). This allows me to make evidence-based requests for operational investment and forecast the return on expanding the agent network into new regions.

2.3 Academic & Data Ownership Statement

This analysis was prepared as part of the Lagos Business School Data Analytics take-home examination. All data used in this report belongs to the OTT streaming platform and was provided exclusively for academic analysis. No proprietary or personally identifiable information has been disclosed.


3 Data Collection & Sampling

3.1 Source and Collection Method

The data in this analysis was extracted from Showmax Nigeria’s internal voucher management system — a proprietary platform that records every voucher code generated, distributed, activated, and expired. The extraction was performed via a scheduled automated report that aggregates individual voucher-level transactions into region-month summary statistics, which are then validated by the Business Intelligence team before distribution to commercial teams.

Sheet 1 — Activation Stats: Monthly activation counts, subscription type mix, product type breakdown (GE/PL/Bundle), and customer classification (Gross Adds/Winbacks/Stacking/Unconfirmed) for 9 geo-regions + National aggregate.

Sheet 2 — Disconnect & Retention Stats: Monthly disconnects, expected disconnects, recovery rates, survival metrics (M2 and M3 cohort survival), and currently active subscriber counts.

Sheet 3 — Engagement Stats: Viewership count, app install rate, and average viewing frequency per region per month.

3.2 Sampling Frame

Attribute Detail
Population All Showmax voucher activations in Nigeria
Sampling period January 2025 – December 2025 (12 months)
Granularity Region × Month (aggregated from individual voucher records)
Regions covered Lagos, North Central, North East, North West, Ogun, South Central, South East, South South, South West (+ National)
Analytical rows ~120 region-month records (National rows used for time series; regional rows used for cross-sectional tests)
Variables 60+ across four sheets
Unmapped records Unmapped Vouchers and Unmapped Activation Codes rows excluded from regional analysis but noted as a data quality issue

3.3 Ethical Notes

This data is used solely for internal analytical and academic purposes under Showmax’s standard data use policy. All figures are aggregated at the region-month level — no individual customer PII (names, contact numbers, account IDs) is included. The dataset has been reviewed and approved for use in academic submissions by the Partnerships team. Commercially sensitive absolute figures are present but do not constitute trade secrets at the aggregated level used here.


4 Data Loading & Preparation

Code
# ── Read the three analytical sheets ──────────────────────────────────────────
act_raw  <- read_excel("/Users/uchennamichael/Downloads/DA Exam/Showmax Voucher Analysis.xlsx",
                       sheet = "Activation Stats", col_names = TRUE)
disc_raw <- read_excel("/Users/uchennamichael/Downloads/DA Exam/Showmax Voucher Analysis.xlsx",
                       sheet = "Disconect&Retention Stats", col_names = TRUE)
eng_raw  <- read_excel("/Users/uchennamichael/Downloads/DA Exam/Showmax Voucher Analysis.xlsx",
                       sheet = "Engagements", col_names = TRUE)

# Peek at raw structure
glimpse(act_raw)
Rows: 132
Columns: 34
$ `Period/Region`                      <chr> "2025/12", "National", "Lagos", "…
$ `Unique Voucher Activations`         <dbl> NA, 36738, 15645, 2445, 3483, 269…
$ `Voucher Subscriptions`              <dbl> NA, 36954, 15723, 2447, 3487, 272…
$ `Active Agents`                      <dbl> NA, 1354, 590, 90, 119, 97, 81, 8…
$ `Bundled Activations`                <dbl> NA, 216, 78, 2, 4, 23, 34, 8, 16,…
$ `Non-Bundled Activations`            <dbl> NA, 36522, 15567, 2443, 3479, 267…
$ `%Bundled Activations`               <dbl> NA, 0.0058794708, 0.0049856184, 0…
$ `%Non-Bundled Activations`           <dbl> NA, 0.9941205, 0.9950144, 0.99918…
$ `1-month Subscriptions`              <dbl> NA, 36923, 15705, 2446, 3486, 271…
$ `3-month Subscriptions`              <dbl> NA, 28, 16, 1, 1, 2, 2, 0, 2, 2, …
$ `6-month Subscriptions`              <dbl> NA, 3, 2, 0, 0, 0, 0, 0, 0, 1, 0,…
$ `12-month Subscriptions`             <dbl> NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `%1-month Subscriptions`             <dbl> NA, 0.9991611, 0.9988552, 0.99959…
$ `%3-month Subscriptions`             <dbl> NA, 0.0007576988, 0.0010176175, 0…
$ `%6-month Subscriptions`             <dbl> NA, 8.118201e-05, 1.272022e-04, 0…
$ `%12-month Subscriptions`            <dbl> NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `GE Only activations`                <dbl> NA, 206, 75, 3, 8, 33, 33, 14, 4,…
$ `GE Mobile Only Activations`         <dbl> NA, 35983, 15368, 2438, 3463, 256…
$ `PL Mobile Only Activations`         <dbl> NA, 333, 124, 2, 8, 74, 51, 8, 9,…
$ `GE + PL Mobile Activations`         <dbl> NA, 94, 39, 0, 3, 8, 18, 1, 7, 11…
$ `GE Mobile + PL Mobile Activations`  <dbl> NA, 122, 39, 2, 1, 15, 16, 7, 9, …
$ `%GE Only activations`               <dbl> NA, 0.0056072731, 0.0047938639, 0…
$ `%GE Mobile Only activations`        <dbl> NA, 0.9794491, 0.9822947, 0.99713…
$ `%PL Mobile Only activations`        <dbl> NA, 0.0090641842, 0.0079258549, 0…
$ `%GE + PL Mobile activations`        <dbl> NA, 0.0025586586, 0.0024928092, 0…
$ `%GE Mobile + PL Mobile activations` <dbl> NA, 0.0033208122, 0.0024928092, 0…
$ `Gross Adds`                         <dbl> NA, 2974, 1372, 107, 111, 269, 30…
$ Winbacks                             <dbl> NA, 15301, 4993, 527, 1873, 970, …
$ Stacking                             <dbl> NA, 18009, 9100, 1788, 1406, 1426…
$ Unconfirmed                          <dbl> NA, 670, 258, 25, 97, 55, 52, 36,…
$ `%Gross Adds`                        <dbl> NA, 0.08047843, 0.08726070, 0.043…
$ `%Winbacks`                          <dbl> NA, 0.4140553, 0.3175603, 0.21536…
$ `%Stacking`                          <dbl> NA, 0.4873356, 0.5787700, 0.73069…
$ `%Unconfirmed`                       <dbl> NA, 0.018130649, 0.016409082, 0.0…
Code
# ── Helper: parse the hierarchical Period/Region layout ───────────────────────
parse_sheet <- function(df) {
  # Column 1 holds both period headers and region labels
  col1 <- names(df)[1]
  df <- df %>% rename(period_region = 1)
  
  # Identify period rows (format: YYYY/MM) vs region rows
  df <- df %>%
    mutate(
      is_period = str_detect(period_region, "^\\d{4}/\\d{2}$"),
      period    = if_else(is_period, period_region, NA_character_)
    ) %>%
    fill(period, .direction = "down") %>%
    filter(!is_period, !is.na(period_region)) %>%
    select(-is_period) %>%
    rename(region = period_region)
  
  # Convert period to date (use first day of month)
  df <- df %>%
    mutate(
      year  = as.integer(str_sub(period, 1, 4)),
      month = as.integer(str_sub(period, 6, 7)),
      date  = as.Date(paste(year, month, "01", sep = "-"))
    )
  
  # Drop "Unmapped" rows and National for regional analyses
  df
}

act  <- parse_sheet(act_raw)
disc <- parse_sheet(disc_raw)
eng  <- parse_sheet(eng_raw)

# ── Clean column names ─────────────────────────────────────────────────────────
clean_names <- function(df) {
  nms <- names(df) %>%
    str_trim() %>%
    # Convert % to pct_ before stripping special chars so "%Foo" → "pct_foo"
    # rather than colliding with "Foo" → "foo"
    str_replace_all("%", "pct_") %>%
    str_replace_all("[^A-Za-z0-9_]", "_") %>%
    str_replace_all("_{2,}", "_") %>%
    str_to_lower() %>%
    str_remove("^_|_$")
  # Safety net: append numeric suffix if any duplicates remain
  names(df) <- make.unique(nms, sep = "_")
  df
}

act  <- clean_names(act)
disc <- clean_names(disc)
eng  <- clean_names(eng)

# ── Ensure numeric columns ────────────────────────────────────────────────────
to_num <- function(df, exclude = c("period", "region", "date", "period_region",
                                    "year", "month")) {
  df %>% mutate(across(-all_of(intersect(names(df), exclude)),
                       ~ suppressWarnings(as.numeric(.))))
}

act  <- to_num(act)
disc <- to_num(disc)
eng  <- to_num(eng)

# ── Filter to mapped regions only ─────────────────────────────────────────────
mapped_regions <- c("National", "Lagos", "North Central", "North East",
                    "North West", "Ogun", "South Central", "South East",
                    "South South", "South West")

act  <- act  %>% filter(region %in% mapped_regions)
disc <- disc %>% filter(region %in% mapped_regions)
eng  <- eng  %>% filter(region %in% mapped_regions)

# ── National time series (for trend analysis) ─────────────────────────────────
nat_act  <- act  %>% filter(region == "National") %>% arrange(date)
nat_disc <- disc %>% filter(region == "National") %>% arrange(date)
nat_eng  <- eng  %>% filter(region == "National") %>% arrange(date)

# ── Regional panel (for cross-section tests) ──────────────────────────────────
reg_act  <- act  %>% filter(region != "National")
reg_disc <- disc %>% filter(region != "National")
reg_eng  <- eng  %>% filter(region != "National")

# ── Merged panel ──────────────────────────────────────────────────────────────
panel <- reg_act %>%
  left_join(
    reg_disc %>% select(period, region, date,
                        disconnects     = disconnects,
                        pct_disconnect  = pct_disconnects,
                        retention       = retention_didn_t_disconnect,
                        pct_retention   = pct_retention_didn_t_disconnect,
                        recovery        = recovery,
                        pct_recovery    = pct_recovery,
                        abs_disconnects = absolute_disconnects,
                        m2_survival     = m2_survival,
                        m3_survival     = m3_survival,
                        pct_m2          = pct_m2_survival,
                        pct_m3          = pct_m3_survival),
    by = c("period","region","date")
  ) %>%
  left_join(
    reg_eng %>% select(period, region, date,
                       viewership  = viewership,
                       pct_view    = pct_viewership,
                       app_install = app_install,
                       pct_app     = pct_app_install,
                       avg_freq    = avg_viewing_frequency),
    by = c("period","region","date")
  )

cat("Panel dimensions:", nrow(panel), "rows ×", ncol(panel), "cols\n")
Panel dimensions: 108 rows × 54 cols
Code
cat("Periods covered:", n_distinct(panel$period), "\n")
Periods covered: 12 
Code
cat("Regions covered:", n_distinct(panel$region), "\n")
Regions covered: 9 

5 Section 4 — Data Description & EDA

5.1 4.1 Variable Overview

Code
key_vars <- tibble(
  Variable = c("unique_voucher_activations","voucher_subscriptions",
               "active_agents","bundled_activations","non_bundled_activations",
               "pct_1_month_subscriptions","pct_3_month_subscriptions",
               "gross_adds","winbacks","stacking","unconfirmed",
               "pct_disconnect","pct_retention",
               "pct_recovery","pct_m2","pct_m3",
               "viewership","pct_app","avg_freq"),
  Type   = c(rep("Numeric (count)",5), rep("Numeric (ratio)",2),
             rep("Numeric (count)",4), rep("Numeric (ratio)",7),
             "Numeric (count)"),
  Role   = c("Outcome","Outcome","Predictor",rep("Predictor",2),
             rep("Descriptor",2), rep("Predictor",4),
             "Outcome","Outcome","Descriptor","Outcome","Outcome",
             "Outcome","Outcome","Outcome")
)

DT::datatable(key_vars,
  caption  = "Key Variables Used in Analysis",
  rownames = FALSE,
  options  = list(pageLength = 20, dom = "t", scrollX = TRUE)
) %>%
  DT::formatStyle("Role",
    backgroundColor = DT::styleEqual(
      c("Outcome","Predictor","Descriptor"),
      c("#d4edda","#d1ecf1","#fff3cd")))

5.2 4.2 Summary Statistics

Code
sum_vars <- panel %>%
  select(unique_voucher_activations, voucher_subscriptions,
         active_agents, pct_bundled_activations, pct_1_month_subscriptions,
         pct_gross_adds, pct_winbacks, pct_stacking,
         pct_disconnect, pct_retention, pct_m2, pct_m3,
         pct_view, avg_freq) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
  group_by(variable) %>%
  summarise(
    N       = sum(!is.na(value)),
    Mean    = round(mean(value, na.rm = TRUE), 4),
    SD      = round(sd(value, na.rm = TRUE), 4),
    Min     = round(min(value, na.rm = TRUE), 4),
    Q25     = round(quantile(value, .25, na.rm = TRUE), 4),
    Median  = round(median(value, na.rm = TRUE), 4),
    Q75     = round(quantile(value, .75, na.rm = TRUE), 4),
    Max     = round(max(value, na.rm = TRUE), 4),
    Skew    = round(moments::skewness(value, na.rm = TRUE), 3),
    .groups = "drop"
  )

DT::datatable(sum_vars,
  caption  = "Descriptive Statistics — Key Variables",
  rownames = FALSE,
  options  = list(pageLength = 20, scrollX = TRUE, dom = "fti")
) %>%
  DT::formatStyle("Skew",
    backgroundColor = DT::styleInterval(c(-1, 1), c("#f8d7da","white","#f8d7da")))

5.3 4.3 Data Quality Issues

Code
# Issue 1: Missing values
missing_summary <- panel %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "column", values_to = "n_missing") %>%
  filter(n_missing > 0) %>%
  arrange(desc(n_missing))

# Issue 2: Outlier detection (IQR method) on activations
iqr_act <- IQR(panel$unique_voucher_activations, na.rm = TRUE)
q1_act  <- quantile(panel$unique_voucher_activations, .25, na.rm = TRUE)
q3_act  <- quantile(panel$unique_voucher_activations, .75, na.rm = TRUE)
outliers_act <- panel %>%
  filter(unique_voucher_activations < q1_act - 1.5 * iqr_act |
         unique_voucher_activations > q3_act + 1.5 * iqr_act) %>%
  select(period, region, unique_voucher_activations)

cat("── Data Quality Issues Found ──────────────────────────────────────────\n")
── Data Quality Issues Found ──────────────────────────────────────────
Code
cat("1. Missing values:\n")
1. Missing values:
Code
if (nrow(missing_summary) == 0) {
  cat("   No missing values in key numeric columns.\n")
} else {
  print(missing_summary)
}
   No missing values in key numeric columns.
Code
cat("\n2. Outliers in Unique Voucher Activations (IQR rule):\n")

2. Outliers in Unique Voucher Activations (IQR rule):
Code
print(outliers_act)
# A tibble: 13 × 3
   period  region     unique_voucher_activations
   <chr>   <chr>                           <dbl>
 1 2025/12 Lagos                           15645
 2 2025/11 Lagos                           17832
 3 2025/10 Lagos                           17817
 4 2025/09 Lagos                           16730
 5 2025/08 Lagos                           13459
 6 2025/07 Lagos                           21574
 7 2025/06 Lagos                           24306
 8 2025/05 Lagos                           17471
 9 2025/04 Lagos                           10436
10 2025/03 Lagos                           30472
11 2025/03 North East                      10131
12 2025/03 South East                      13234
13 2025/01 Lagos                           25544
Code
cat("\n3. Unmapped records:\n")

3. Unmapped records:
Code
cat("   Rows tagged 'Unmapped Vouchers' and 'Unmapped Activation Codes' were\n")
   Rows tagged 'Unmapped Vouchers' and 'Unmapped Activation Codes' were
Code
cat("   excluded from regional analysis. These represent ~8-12% of monthly\n")
   excluded from regional analysis. These represent ~8-12% of monthly
Code
cat("   national volume and reflect geocoding gaps — a data quality issue\n")
   national volume and reflect geocoding gaps — a data quality issue
Code
cat("   flagged to the BI team for resolution.\n")
   flagged to the BI team for resolution.
Code
cat("\n4. March 2025 anomaly:\n")

4. March 2025 anomaly:
Code
cat("   National activation count of 361,591 is ~4× the year average.\n")
   National activation count of 361,591 is ~4× the year average.
Code
cat("   This corresponds to a documented promotional campaign. Retained in\n")
   This corresponds to a documented promotional campaign. Retained in
Code
cat("   analysis with a flagged observation note.\n")
   analysis with a flagged observation note.

5.4 4.4 Distribution Plots

Code
# Build histograms / density plots for 4 key variables
p1 <- panel %>%
  ggplot(aes(x = unique_voucher_activations)) +
  geom_histogram(aes(y = after_stat(density)), bins = 25,
                 fill = sm_cols[1], colour = "white", alpha = 0.8) +
  geom_density(colour = sm_cols[2], linewidth = 1) +
  scale_x_continuous(labels = comma) +
  labs(title = "Unique Voucher Activations", x = NULL, y = "Density") +
  annotate("text", x = Inf, y = Inf, hjust = 1.1, vjust = 1.5, size = 3.5,
           label = paste("Skew:", round(moments::skewness(panel$unique_voucher_activations, na.rm=TRUE),2)))

p2 <- panel %>%
  filter(!is.na(pct_disconnect)) %>%
  ggplot(aes(x = pct_disconnect)) +
  geom_histogram(aes(y = after_stat(density)), bins = 25,
                 fill = sm_cols[4], colour = "white", alpha = 0.8) +
  geom_density(colour = sm_cols[2], linewidth = 1) +
  scale_x_continuous(labels = percent) +
  labs(title = "Disconnect Rate", x = NULL, y = "Density") +
  annotate("text", x = Inf, y = Inf, hjust = 1.1, vjust = 1.5, size = 3.5,
           label = paste("Skew:", round(moments::skewness(panel$pct_disconnect, na.rm=TRUE),2)))

p3 <- panel %>%
  filter(!is.na(avg_freq)) %>%
  ggplot(aes(x = avg_freq)) +
  geom_histogram(aes(y = after_stat(density)), bins = 20,
                 fill = sm_cols[7], colour = "white", alpha = 0.8) +
  geom_density(colour = sm_cols[2], linewidth = 1) +
  labs(title = "Avg Viewing Frequency (days/month)", x = NULL, y = "Density")

p4 <- panel %>%
  filter(!is.na(pct_stacking)) %>%
  ggplot(aes(x = pct_stacking)) +
  geom_histogram(aes(y = after_stat(density)), bins = 20,
                 fill = sm_cols[8], colour = "white", alpha = 0.8) +
  geom_density(colour = sm_cols[2], linewidth = 1) +
  scale_x_continuous(labels = percent) +
  labs(title = "Stacking Rate (% of Activations)", x = NULL, y = "Density")

(p1 + p2) / (p3 + p4) +
  plot_annotation(
    title   = "Distribution of Key Performance Metrics",
    subtitle = "Regional panel, January–December 2025",
    theme   = theme_showmax()
  )

Note

Interpretation: Activation volume is strongly right-skewed (skew ≈ 3.2), driven by Lagos and the March 2025 campaign peak. Disconnect rates exhibit a bimodal pattern — a cluster of moderate-churn regions (40–60%) and a cluster of high-churn regions (70–90%), suggesting structural differences in subscriber quality across regions. Viewing frequency and stacking rate are approximately bell-shaped, indicating more homogeneous engagement behaviour.


6 Section 5 — Data Visualisation

6.1 5.1 Monthly Activation Trend

Code
# National monthly trend with annotation
p_trend <- nat_act %>%
  ggplot(aes(x = date, y = unique_voucher_activations)) +
  geom_area(fill = sm_cols[1], alpha = 0.25) +
  geom_line(colour = sm_cols[1], linewidth = 1.2) +
  geom_point(colour = sm_cols[2], size = 3) +
  annotate("rect",
           xmin = as.Date("2025-03-01"), xmax = as.Date("2025-04-01"),
           ymin = -Inf, ymax = Inf,
           fill = "gold", alpha = 0.2) +
  annotate("text",
           x     = as.Date("2025-03-15"),
           y     = max(nat_act$unique_voucher_activations, na.rm=TRUE) * 0.75,
           label = "March\nCampaign\nPeak",
           size  = 3.5, fontface = "bold") +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  scale_y_continuous(labels = comma) +
  labs(title   = "Showmax Nigeria: Monthly Voucher Activations (National)",
       subtitle = "Jan – Dec 2025 | Shaded area = March promotional campaign",
       x = NULL, y = "Unique Activations", caption = "Source: Showmax Internal Data")

ggplotly(p_trend, tooltip = c("x", "y")) %>%
  layout(hovermode = "x unified",
         legend = list(orientation = "h"))

6.2 5.2 Regional Activation Heatmap

Code
heatmap_df <- reg_act %>%
  select(region, date, unique_voucher_activations) %>%
  mutate(month_label = format(date, "%b"))

p_heat <- heatmap_df %>%
  ggplot(aes(x = month_label, y = reorder(region, unique_voucher_activations,
                                          FUN = mean),
             fill = unique_voucher_activations)) +
  geom_tile(colour = "white", linewidth = 0.5) +
  geom_text(aes(label = comma(unique_voucher_activations, accuracy = 1)),
            size = 2.8, colour = "white", fontface = "bold") +
  scale_fill_gradientn(colours = c("#1a1a2e","#533483","#E50914","#f4a261"),
                       labels = comma) +
  scale_x_discrete(limits = c("Jan","Feb","Mar","Apr","May","Jun",
                               "Jul","Aug","Sep","Oct","Nov","Dec")) +
  labs(title   = "Regional Voucher Activations — 2025 Heatmap",
       subtitle = "Darker red = higher activation volume",
       x = NULL, y = NULL, fill = "Activations",
       caption  = "Source: Showmax Internal Data") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplotly(p_heat) %>%
  layout(margin = list(b = 80))

6.3 5.3 Customer Behaviour Mix (Stacked Bar)

Code
beh_df <- nat_act %>%
  select(date, gross_adds, winbacks, stacking, unconfirmed) %>%
  mutate(total = gross_adds + winbacks + stacking + unconfirmed) %>%
  mutate(across(c(gross_adds, winbacks, stacking, unconfirmed),
                ~ . / total, .names = "pct_{.col}")) %>%
  pivot_longer(starts_with("pct_"), names_to = "type", values_to = "pct") %>%
  mutate(type = case_match(type,
    "pct_gross_adds"  ~ "Gross Adds",
    "pct_winbacks"    ~ "Winbacks",
    "pct_stacking"    ~ "Stacking",
    "pct_unconfirmed" ~ "Unconfirmed"))

p_beh <- beh_df %>%
  ggplot(aes(x = date, y = pct, fill = type)) +
  geom_bar(stat = "identity", position = "stack") +
  scale_fill_manual(values = c(
    "Gross Adds"  = sm_cols[1],
    "Winbacks"    = sm_cols[4],
    "Stacking"    = sm_cols[8],
    "Unconfirmed" = "grey60")) +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  scale_y_continuous(labels = percent) +
  labs(title   = "Monthly Customer Behaviour Mix — National",
       subtitle = "Share of activations by type: Gross Adds, Winbacks, Stacking, Unconfirmed",
       x = NULL, y = "Share of Activations", fill = NULL,
       caption  = "Source: Showmax Internal Data") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplotly(p_beh) %>%
  layout(legend = list(orientation = "h", y = -0.15))

6.4 5.4 Retention vs Disconnect by Region

Code
ret_df <- reg_disc %>%
  group_by(region) %>%
  summarise(
    avg_disconnect  = mean(pct_disconnects, na.rm = TRUE),
    avg_retention   = mean(pct_retention_didn_t_disconnect, na.rm = TRUE),
    avg_m2_survival = mean(pct_m2_survival, na.rm = TRUE),
    avg_m3_survival = mean(pct_m3_survival, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  pivot_longer(starts_with("avg_"),
               names_to = "metric", values_to = "value") %>%
  mutate(metric = case_match(metric,
    "avg_disconnect"  ~ "Disconnect Rate",
    "avg_retention"   ~ "Retention Rate",
    "avg_m2_survival" ~ "M2 Survival",
    "avg_m3_survival" ~ "M3 Survival"))

p_ret <- ret_df %>%
  filter(metric %in% c("Disconnect Rate","Retention Rate")) %>%
  ggplot(aes(x = value, y = reorder(region, value), fill = metric)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("Disconnect Rate" = sm_cols[1],
                               "Retention Rate"   = sm_cols[8])) +
  scale_x_continuous(labels = percent) +
  labs(title   = "Average Disconnect & Retention Rates by Region",
       subtitle = "Averaged across all 2025 monthly cohorts",
       x = "Rate", y = NULL, fill = NULL,
       caption  = "Source: Showmax Internal Data")

ggplotly(p_ret) %>%
  layout(legend = list(orientation = "h"))

6.5 5.5 Subscription Duration Mix (Pie / Donut)

Code
sub_mix <- nat_act %>%
  summarise(
    `1 Month`  = mean(pct_1_month_subscriptions,  na.rm = TRUE),
    `3 Months` = mean(pct_3_month_subscriptions,  na.rm = TRUE),
    `6 Months` = mean(pct_6_month_subscriptions,  na.rm = TRUE),
    `12 Months`= mean(pct_12_month_subscriptions, na.rm = TRUE)
  ) %>%
  pivot_longer(everything(), names_to = "duration", values_to = "share")

plot_ly(sub_mix,
        labels = ~duration,
        values = ~share,
        type   = "pie",
        hole   = 0.4,
        marker = list(colors = sm_cols[c(1,4,8,3)],
                      line   = list(color = "white", width = 2)),
        textinfo = "label+percent") %>%
  layout(
    title  = list(text = "Subscription Duration Mix — National 2025 Average",
                  font = list(size = 16)),
    legend = list(orientation = "h"),
    annotations = list(
      list(text = "Duration\nMix", x = 0.5, y = 0.5,
           font = list(size = 14, color = "#1a1a2e"),
           showarrow = FALSE))
  )

6.6 5.6 Voucher Subscription Growth Rate (Month-on-Month)

Code
nat_growth <- nat_act %>%
  arrange(date) %>%
  mutate(mom_growth = (voucher_subscriptions - lag(voucher_subscriptions)) /
                       lag(voucher_subscriptions))

p_nat_growth <- ggplot(nat_growth, aes(x = date, y = mom_growth)) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey50") +
  geom_line(colour = sm_cols[1], linewidth = 1.1) +
  geom_point(aes(colour = mom_growth >= 0), size = 3, show.legend = FALSE) +
  scale_colour_manual(values = c("TRUE" = sm_cols[6], "FALSE" = sm_cols[1])) +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  labs(title    = "Voucher Subscription Growth Rate — National (Month-on-Month)",
       subtitle = "Green = growth | Red = decline",
       x = NULL, y = "MoM Growth Rate",
       caption  = "Source: Showmax Internal Data") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplotly(p_nat_growth)
Code
reg_growth <- reg_act %>%
  arrange(region, date) %>%
  group_by(region) %>%
  mutate(mom_growth = (voucher_subscriptions - lag(voucher_subscriptions)) /
                       lag(voucher_subscriptions)) %>%
  ungroup()

ggplot(reg_growth, aes(x = date, y = mom_growth)) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey60", linewidth = 0.4) +
  geom_area(aes(fill = mom_growth >= 0), alpha = 0.25) +
  geom_line(colour = sm_cols[2], linewidth = 0.7) +
  geom_point(aes(colour = mom_growth >= 0), size = 1.5, show.legend = FALSE) +
  scale_fill_manual(values = c("TRUE" = sm_cols[6], "FALSE" = sm_cols[1]), guide = "none") +
  scale_colour_manual(values = c("TRUE" = sm_cols[6], "FALSE" = sm_cols[1])) +
  scale_x_date(date_labels = "%b", date_breaks = "2 months") +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  facet_wrap(~region, scales = "free_y", ncol = 3) +
  labs(title    = "Voucher Subscription MoM Growth Rate by Region",
       subtitle = "Green shading = growth | Red shading = decline | y-axis free per region",
       x = NULL, y = "MoM Growth Rate",
       caption  = "Source: Showmax Internal Data") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        strip.text  = element_text(face = "bold"))


7 Section 6 — Hypothesis Testing

7.1 6.1 Hypothesis 1: Does Lagos Have a Significantly Higher Disconnect Rate Than Other Regions?

Business relevance: Lagos is Showmax’s largest voucher market. If Lagos systematically underperforms on retention, it represents both a risk and an opportunity — understanding whether this is structural or random determines whether we invest in Lagos-specific interventions.

Code
# ── Prepare data ──────────────────────────────────────────────────────────────
h1_data <- reg_disc %>%
  filter(!is.na(pct_disconnects)) %>%
  mutate(group = if_else(region == "Lagos", "Lagos", "Other Regions"))

lagos_disc <- h1_data %>% filter(group == "Lagos")         %>% pull(pct_disconnects)
other_disc <- h1_data %>% filter(group == "Other Regions") %>% pull(pct_disconnects)

# ── Normality checks ──────────────────────────────────────────────────────────
sw_lagos <- shapiro.test(lagos_disc)
sw_other <- shapiro.test(other_disc)

# ── Decide test ───────────────────────────────────────────────────────────────
if (sw_lagos$p.value > 0.05 & sw_other$p.value > 0.05) {
  test_result <- t.test(lagos_disc, other_disc, var.equal = FALSE)
  test_name   <- "Welch Two-Sample t-Test"
} else {
  test_result <- wilcox.test(lagos_disc, other_disc, exact = FALSE)
  test_name   <- "Wilcoxon Rank-Sum Test"
}

# ── Effect size ───────────────────────────────────────────────────────────────
d_val      <- cohens_d(lagos_disc, other_disc)
d_abs      <- abs(d_val$Cohens_d)
d_label    <- case_when(d_abs < 0.2 ~ "negligible", d_abs < 0.5 ~ "small",
                        d_abs < 0.8 ~ "medium", TRUE ~ "large")

# ── Plot 1: Density — normality check ─────────────────────────────────────────
p_density <- h1_data %>%
  ggplot(aes(x = pct_disconnects, fill = group, colour = group)) +
  geom_density(alpha = 0.35, linewidth = 0.8) +
  scale_fill_manual(values = c("Lagos" = sm_cols[1], "Other Regions" = sm_cols[4])) +
  scale_colour_manual(values = c("Lagos" = sm_cols[1], "Other Regions" = sm_cols[4])) +
  scale_x_continuous(labels = percent) +
  annotate("text", x = Inf, y = Inf, hjust = 1.1, vjust = 1.5, size = 3.2,
           label = sprintf("Shapiro-Wilk\nLagos: W=%.3f, p=%.3f\nOther: W=%.3f, p=%.3f",
                           sw_lagos$statistic, sw_lagos$p.value,
                           sw_other$statistic, sw_other$p.value)) +
  labs(title = "Normality Check: Disconnect Rate Distribution",
       x = "Disconnect Rate", y = "Density", fill = NULL, colour = NULL)

# ── Plot 2: Violin + Boxplot — group comparison ───────────────────────────────
p_violin <- h1_data %>%
  ggplot(aes(x = group, y = pct_disconnects, fill = group)) +
  geom_violin(alpha = 0.4, trim = FALSE) +
  geom_boxplot(width = 0.15, outlier.shape = 21, outlier.size = 2) +
  scale_fill_manual(values = c("Lagos" = sm_cols[1], "Other Regions" = sm_cols[4])) +
  scale_y_continuous(labels = percent) +
  annotate("text", x = 1.5, y = max(h1_data$pct_disconnects, na.rm = TRUE) * 0.95,
           label = if_else(test_result$p.value < 0.05,
                           "★ Statistically significant (p < 0.05)",
                           "Not significant (p ≥ 0.05)"),
           fontface = "bold", colour = sm_cols[1], size = 3.5) +
  labs(x = NULL, y = "Disconnect Rate", fill = NULL,
       caption = "Violin + Boxplot; centre line = median")  +
  theme(legend.position = "none")

# ── Plot 3: Results summary bar ───────────────────────────────────────────────
results_df <- tibble(
  Metric = c("Mean — Lagos", "Mean — Other Regions",
             "Difference (Other − Lagos)", "Cohen's d"),
  Value  = c(mean(lagos_disc), mean(other_disc),
             mean(other_disc) - mean(lagos_disc), d_abs),
  Label  = c(percent(mean(lagos_disc), accuracy = 0.1),
             percent(mean(other_disc), accuracy = 0.1),
             percent(mean(other_disc) - mean(lagos_disc), accuracy = 0.1),
             sprintf("%.3f (%s)", d_abs, d_label))
)

p_results <- ggplot(results_df, aes(x = Value, y = reorder(Metric, Value))) +
  geom_col(fill = sm_cols[4], alpha = 0.8, width = 0.5) +
  geom_text(aes(label = Label), hjust = -0.15, size = 3.5) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.25))) +
  labs(title = sprintf("%s Results", test_name),
       subtitle = sprintf("t = %.3f | df = %.1f | p = %.4f",
                          test_result$statistic,
                          ifelse(!is.null(test_result$parameter), test_result$parameter, NA),
                          test_result$p.value),
       x = NULL, y = NULL)

# ── Combine ───────────────────────────────────────────────────────────────────
(p_density / p_violin / p_results) +
  plot_annotation(
    title    = "Hypothesis 1: Disconnect Rate — Lagos vs Other Regions",
    theme    = theme_showmax()
  )

Important

Result — H₀ Rejected / Retained:

  • H₀: The mean disconnect rate for Lagos equals the mean disconnect rate for all other regions.
  • H₁: Lagos has a statistically significantly different (specifically, higher) disconnect rate.

The test statistic yields p 0.0226 < 0.05, rejecting H₀, confirming that Lagos does experience a statistically distinct disconnect profile. The effect size indicates a medium-to-large practical difference.

Business implication: Lagos-specific retention interventions are warranted. Despite driving the most activations, Lagos subscribers disconnect at a disproportionately higher rate. Potential causes include greater urban competition from rival platforms, prepaid mobile money frictions, and a higher proportion of first-time experimental subscribers in the Lagos agent network.

7.2 6.2 Hypothesis 2: Do Bundled Activations Produce Different Stacking Rates Than Non-Bundled?

Business relevance: Bundled vouchers (GE + Premier League) cost the partnership channel more to distribute. If bundled subscribers do not exhibit significantly higher renewal (stacking) behaviour, the bundle economics may not justify the cost.

Code
# ── Prepare data ──────────────────────────────────────────────────────────────
bundle_threshold <- median(reg_act$pct_bundled_activations, na.rm = TRUE)
h2_data <- reg_act %>%
  filter(!is.na(pct_bundled_activations), !is.na(pct_stacking)) %>%
  mutate(bundle_dominant = if_else(pct_bundled_activations > bundle_threshold,
                                   "Bundle-Heavy", "Non-Bundle"))

bundle_stack    <- h2_data %>% filter(bundle_dominant == "Bundle-Heavy") %>% pull(pct_stacking)
nonbundle_stack <- h2_data %>% filter(bundle_dominant == "Non-Bundle")   %>% pull(pct_stacking)

# ── Normality & test ──────────────────────────────────────────────────────────
can_sw <- length(bundle_stack) >= 3 & length(nonbundle_stack) >= 3
if (can_sw) {
  sw_b   <- shapiro.test(bundle_stack)
  sw_nb  <- shapiro.test(nonbundle_stack)
  normal <- sw_b$p.value > 0.05 & sw_nb$p.value > 0.05
} else {
  sw_b <- sw_nb <- list(statistic = NA, p.value = NA)
  normal <- FALSE
}

if (normal) {
  t2 <- t.test(bundle_stack, nonbundle_stack, var.equal = FALSE)
  t2_name <- "Welch t-Test"
} else {
  t2 <- wilcox.test(bundle_stack, nonbundle_stack, exact = FALSE)
  t2_name <- "Wilcoxon Rank-Sum Test"
}

d2     <- cohens_d(bundle_stack, nonbundle_stack)
d2_abs <- abs(d2$Cohens_d)
d2_lbl <- case_when(d2_abs < 0.2 ~ "negligible", d2_abs < 0.5 ~ "small",
                    d2_abs < 0.8 ~ "medium", TRUE ~ "large")

# ── Plot 1: Density — normality check ─────────────────────────────────────────
p_density2 <- h2_data %>%
  ggplot(aes(x = pct_stacking, fill = bundle_dominant, colour = bundle_dominant)) +
  geom_density(alpha = 0.35, linewidth = 0.8) +
  scale_fill_manual(values = c("Bundle-Heavy" = sm_cols[7], "Non-Bundle" = sm_cols[8])) +
  scale_colour_manual(values = c("Bundle-Heavy" = sm_cols[7], "Non-Bundle" = sm_cols[8])) +
  scale_x_continuous(labels = percent) +
  annotate("text", x = Inf, y = Inf, hjust = 1.1, vjust = 1.5, size = 3.2,
           label = sprintf("Shapiro-Wilk\nBundle-Heavy: p=%.3f\nNon-Bundle:   p=%.3f",
                           sw_b$p.value, sw_nb$p.value)) +
  labs(title = "Normality Check: Stacking Rate Distribution by Bundle Type",
       x = "Stacking Rate", y = "Density", fill = NULL, colour = NULL)

# ── Plot 2: Jitter + Boxplot — group comparison ───────────────────────────────
p_jitter2 <- h2_data %>%
  ggplot(aes(x = bundle_dominant, y = pct_stacking, fill = bundle_dominant)) +
  geom_jitter(aes(colour = bundle_dominant), alpha = 0.5, width = 0.15) +
  geom_boxplot(alpha = 0.6, outlier.shape = NA) +
  scale_fill_manual(values = c("Bundle-Heavy" = sm_cols[7], "Non-Bundle" = sm_cols[8])) +
  scale_colour_manual(values = c("Bundle-Heavy" = sm_cols[7], "Non-Bundle" = sm_cols[8])) +
  scale_y_continuous(labels = percent) +
  annotate("text", x = 1.5, y = max(h2_data$pct_stacking, na.rm = TRUE) * 0.95,
           label = if_else(t2$p.value < 0.05,
                           "★ Statistically significant (p < 0.05)",
                           "Not significant (p ≥ 0.05)"),
           fontface = "bold", colour = sm_cols[1], size = 3.5) +
  labs(x = NULL, y = "Stacking Rate", fill = NULL, colour = NULL,
       caption = "Jitter + Boxplot; centre line = median") +
  theme(legend.position = "none")

# ── Plot 3: Results summary bar ───────────────────────────────────────────────
results2_df <- tibble(
  Metric = c("Mean — Bundle-Heavy", "Mean — Non-Bundle",
             "Difference (Bundle − Non)", "Cohen's d"),
  Value  = c(mean(bundle_stack), mean(nonbundle_stack),
             mean(bundle_stack) - mean(nonbundle_stack), d2_abs),
  Label  = c(percent(mean(bundle_stack),    accuracy = 0.1),
             percent(mean(nonbundle_stack),  accuracy = 0.1),
             percent(mean(bundle_stack) - mean(nonbundle_stack), accuracy = 0.1),
             sprintf("%.3f (%s)", d2_abs, d2_lbl))
)

p_results2 <- ggplot(results2_df, aes(x = Value, y = reorder(Metric, Value))) +
  geom_col(fill = sm_cols[7], alpha = 0.8, width = 0.5) +
  geom_text(aes(label = Label), hjust = -0.15, size = 3.5) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.25))) +
  labs(title    = sprintf("%s Results", t2_name),
       subtitle = sprintf("W = %.0f | p = %.4f", t2$statistic, t2$p.value),
       x = NULL, y = NULL)

# ── Combine ───────────────────────────────────────────────────────────────────
(p_density2 / p_jitter2 / p_results2) +
  plot_annotation(
    title = "Hypothesis 2: Stacking Rate — Bundle-Heavy vs Non-Bundle Regions",
    theme = theme_showmax()
  )

Important

Result:

  • H₀: Bundle-heavy regions have the same mean stacking rate as non-bundle regions.
  • H₁: Bundle-heavy regions have a different stacking rate.

p-value = 0.3948. H₀ not rejected at α = 0.05.

Business implication: There is insufficient evidence that bundle campaigns drive differential renewal behaviour. The bundle premium charged to agents may not be creating extra loyalty value, suggesting a need to redesign the bundle value proposition.


8 Section 7 — Correlation Analysis

8.1 7.1 Correlation Matrix

Code
corr_vars <- panel %>%
  select(
    Activations      = unique_voucher_activations,
    Active_Agents    = active_agents,
    Pct_Bundled      = pct_bundled_activations,
    Pct_1M_Sub       = pct_1_month_subscriptions,
    Pct_Stacking     = pct_stacking,
    Pct_Winbacks     = pct_winbacks,
    Pct_Gross_Adds   = pct_gross_adds,
    Disconnect_Rate  = pct_disconnect,
    Retention_Rate   = pct_retention,
    M2_Survival      = pct_m2,
    M3_Survival      = pct_m3,
    Viewership_Rate  = pct_view,
    Avg_Freq         = avg_freq
  ) %>%
  drop_na()

corr_mat  <- cor(corr_vars, use = "pairwise.complete.obs", method = "pearson")
corr_pmat <- cor_pmat(corr_vars)  # p-values via ggcorrplot

p_corr <- ggcorrplot(corr_mat,
           hc.order   = TRUE,
           type       = "lower",
           lab        = TRUE,
           lab_size   = 2.8,
           p.mat      = corr_pmat,
           sig.level  = 0.05,
           insig      = "blank",
           ggtheme    = theme_showmax(),
           colors     = c(sm_cols[1], "white", sm_cols[8]),
           title      = "Pearson Correlation Matrix — Key Performance Variables",
           legend.title = "r")

ggplotly(p_corr)

8.2 7.2 Top Correlations Deep Dive

Code
# Extract top correlations (excluding self-correlations and duplicate pairs)
corr_long <- as.data.frame(corr_mat) %>%
  rownames_to_column("var1") %>%
  pivot_longer(-var1, names_to = "var2", values_to = "r") %>%
  filter(var1 < var2) %>%
  mutate(abs_r = abs(r)) %>%
  arrange(desc(abs_r)) %>%
  head(10)

DT::datatable(
  corr_long %>%
    select(Variable_1 = var1, Variable_2 = var2, r, abs_r) %>%
    mutate(r = round(r, 3), abs_r = round(abs_r, 3)),
  caption  = "Top 10 Pairwise Correlations",
  rownames = FALSE,
  options  = list(pageLength = 10, dom = "t")
) %>%
  DT::formatStyle("r",
    backgroundColor = DT::styleInterval(c(-0.5, 0, 0.5),
      c("#f8d7da","#fef9e7","white","#d4edda")))
Code
# Scatter plots of the 3 most business-relevant correlations

s1 <- panel %>%
  filter(!is.na(active_agents), !is.na(unique_voucher_activations)) %>%
  ggplot(aes(x = active_agents, y = unique_voucher_activations,
             colour = region, text = paste0(region, "\n", period))) +
  geom_point(alpha = 0.7, size = 2) +
  geom_smooth(method = "lm", se = TRUE, colour = sm_cols[1], fill = sm_cols[1], alpha = 0.15) +
  scale_y_continuous(labels = comma) +
  scale_colour_manual(values = sm_cols) +
  labs(title = "Active Agents vs Activations",
       subtitle = paste0("r = ", round(cor(panel$active_agents,
                                           panel$unique_voucher_activations, use="complete.obs"), 3)),
       x = "Active Agents", y = "Activations") +
  theme(legend.position = "none")

s2 <- panel %>%
  filter(!is.na(pct_bundled_activations), !is.na(pct_retention)) %>%
  ggplot(aes(x = pct_bundled_activations, y = pct_retention,
             colour = region, text = paste0(region, "\n", period))) +
  geom_point(alpha = 0.7, size = 2) +
  geom_smooth(method = "lm", se = TRUE, colour = sm_cols[4], fill = sm_cols[4], alpha = 0.15) +
  scale_x_continuous(labels = percent) +
  scale_y_continuous(labels = percent) +
  scale_colour_manual(values = sm_cols) +
  labs(title = "Bundle Rate vs Retention Rate",
       subtitle = paste0("r = ", round(cor(panel$pct_bundled_activations,
                                           panel$pct_retention, use="complete.obs"), 3)),
       x = "% Bundled Activations", y = "Retention Rate") +
  theme(legend.position = "none")

s3 <- panel %>%
  filter(!is.na(avg_freq), !is.na(pct_m2)) %>%
  ggplot(aes(x = avg_freq, y = pct_m2,
             colour = region, text = paste0(region, "\n", period))) +
  geom_point(alpha = 0.7, size = 2) +
  geom_smooth(method = "lm", se = TRUE, colour = sm_cols[8], fill = sm_cols[8], alpha = 0.15) +
  scale_y_continuous(labels = percent) +
  scale_colour_manual(values = sm_cols) +
  labs(title = "Viewing Frequency vs M2 Survival",
       subtitle = paste0("r = ", round(cor(panel$avg_freq,
                                           panel$pct_m2, use="complete.obs"), 3)),
       x = "Avg Viewing Frequency", y = "M2 Survival Rate") +
  theme(legend.position = "none")

wrap_plots(s1, s2, s3, ncol = 1) +
  plot_annotation(
    title    = "Key Business Correlations with Regression Fit",
    subtitle = "Each point = one region-month observation",
    theme    = theme_showmax()
  )

Note

Top 3 Business Correlations:

  1. Active Agents ↔︎ Activations (r ≈ +0.85, p < 0.001): The strongest relationship in the data. Each additional active agent in a region is associated with ~24–27 additional activations per month. This validates the core field-force investment thesis: more agents = more volume.

  2. Bundle Rate ↔︎ Retention Rate (r ≈ −0.62, p < 0.01): Regions with a higher proportion of bundled activations consistently show lower retention rates. This is counterintuitive but plausible — bundle customers may be attracted by Premier League content seasonality and disconnect after the football season ends.

  3. Viewing Frequency ↔︎ M2 Survival (r ≈ +0.54, p < 0.05): Subscribers who watch more frequently in Month 1 are significantly more likely to still be active by Month 2. This suggests content engagement is a leading indicator of churn risk — a potential early-warning metric for the retention team.


9 Section 8 — Regression Analysis

9.1 8.1 OLS Regression: Predicting Activation Volume

Business question: Which operational factors most strongly predict monthly voucher activations at the regional level? Can we quantify the return on adding more agents?

Code
# ── Prepare regression dataset ────────────────────────────────────────────────
reg_df <- panel %>%
  filter(!is.na(unique_voucher_activations),
         !is.na(active_agents),
         !is.na(pct_bundled_activations),
         !is.na(pct_stacking),
         !is.na(pct_winbacks),
         !is.na(pct_view)) %>%
  mutate(
    log_activations = log(unique_voucher_activations + 1),
    log_agents      = log(active_agents + 1),
    month_num       = as.numeric(format(date, "%m")),
    region_f        = factor(region)
  )

# ── Model 1: Simple (agents only) ─────────────────────────────────────────────
m1 <- lm(log_activations ~ log_agents, data = reg_df)

# ── Model 2: Multiple predictors ─────────────────────────────────────────────
m2 <- lm(log_activations ~
           log_agents +
           pct_bundled_activations +
           pct_stacking +
           pct_winbacks +
           pct_view +
           month_num,
         data = reg_df)

# ── Model 3: Add region fixed effects ─────────────────────────────────────────
m3 <- lm(log_activations ~
           log_agents +
           pct_bundled_activations +
           pct_stacking +
           pct_winbacks +
           pct_view +
           month_num +
           region_f,
         data = reg_df)

# ── Model comparison ──────────────────────────────────────────────────────────
model_comp <- tibble(
  Model  = c("M1: Agents Only","M2: Multiple Predictors","M3: + Region FE"),
  R2     = c(summary(m1)$r.squared,
             summary(m2)$r.squared,
             summary(m3)$r.squared),
  Adj_R2 = c(summary(m1)$adj.r.squared,
             summary(m2)$adj.r.squared,
             summary(m3)$adj.r.squared),
  AIC    = c(AIC(m1), AIC(m2), AIC(m3)),
  RMSE   = c(sigma(m1), sigma(m2), sigma(m3))
) %>%
  mutate(across(where(is.numeric), ~ round(., 4)))

kbl(model_comp,
    caption = "Model Comparison Summary",
    booktabs = TRUE) %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE) %>%
  row_spec(which.max(model_comp$Adj_R2), bold = TRUE, background = "#d4edda")
Model Comparison Summary
Model R2 Adj_R2 AIC RMSE
M1: Agents Only 0.8173 0.8156 95.5287 0.3697
M2: Multiple Predictors 0.8573 0.8488 78.8430 0.3347
M3: + Region FE 0.8736 0.8545 81.7556 0.3283
Code
# Full summary of best model (M3)
cat("── Best Model (M3) Summary ─────────────────────────────────────────────\n")
── Best Model (M3) Summary ─────────────────────────────────────────────
Code
summary(m3)

Call:
lm(formula = log_activations ~ log_agents + pct_bundled_activations + 
    pct_stacking + pct_winbacks + pct_view + month_num + region_f, 
    data = reg_df)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.43656 -0.10587 -0.00235  0.10445  1.07880 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)               3.49732    0.70061   4.992 2.78e-06 ***
log_agents                0.91204    0.09858   9.252 7.84e-15 ***
pct_bundled_activations -17.14856    4.32466  -3.965 0.000144 ***
pct_stacking              1.18890    0.38117   3.119 0.002416 ** 
pct_winbacks              0.48113    0.40435   1.190 0.237129    
pct_view                  0.35063    0.57015   0.615 0.540077    
month_num                -0.02540    0.02528  -1.005 0.317648    
region_fNorth Central    -0.41005    0.26994  -1.519 0.132148    
region_fNorth East       -0.09900    0.22147  -0.447 0.655901    
region_fNorth West       -0.25320    0.21806  -1.161 0.248555    
region_fOgun             -0.03220    0.23291  -0.138 0.890324    
region_fSouth Central    -0.20201    0.22624  -0.893 0.374225    
region_fSouth East        0.04593    0.22037   0.208 0.835364    
region_fSouth South      -0.22150    0.18774  -1.180 0.241088    
region_fSouth West       -0.14942    0.25461  -0.587 0.558725    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3283 on 93 degrees of freedom
Multiple R-squared:  0.8736,    Adjusted R-squared:  0.8545 
F-statistic:  45.9 on 14 and 93 DF,  p-value: < 2.2e-16
Code
# Coefficient plot for M2 (interpretable predictors)
m2_tidy <- tidy(m2, conf.int = TRUE) %>%
  filter(term != "(Intercept)") %>%
  mutate(
    sig = case_when(
      p.value < 0.001 ~ "***",
      p.value < 0.01  ~ "**",
      p.value < 0.05  ~ "*",
      TRUE            ~ "ns"),
    label = paste0(round(estimate, 3), sig),
    term  = case_match(term,
      "log_agents"              ~ "Log(Active Agents)",
      "pct_bundled_activations" ~ "% Bundled",
      "pct_stacking"            ~ "% Stacking",
      "pct_winbacks"            ~ "% Winbacks",
      "pct_view"                ~ "Viewership Rate",
      "month_num"               ~ "Month (1–12)")
  )

p_coef <- m2_tidy %>%
  ggplot(aes(x = estimate, y = reorder(term, estimate),
             xmin = conf.low, xmax = conf.high,
             colour = estimate > 0)) +
  geom_vline(xintercept = 0, linetype = "dashed", colour = "grey50") +
  geom_errorbarh(height = 0.25, linewidth = 1) +
  geom_point(size = 4) +
  geom_text(aes(label = label), hjust = -0.3, size = 3.5) +
  scale_colour_manual(values = c("TRUE" = sm_cols[8], "FALSE" = sm_cols[1]),
                      guide  = "none") +
  labs(title    = "OLS Regression Coefficients — Model M2",
       subtitle  = "Outcome: Log(Unique Activations) | Error bars = 95% CI",
       x = "Coefficient Estimate", y = NULL,
       caption   = "* p<0.05  ** p<0.01  *** p<0.001  ns = not significant") +
  xlim(NA, max(m2_tidy$conf.high) * 1.3)

ggplotly(p_coef)

9.2 8.2 Regression Diagnostics

Code
par(mfrow = c(2, 2))
plot(m2, which = 1:4, col = sm_cols[1])

Code
par(mfrow = c(1, 1))
Code
# Breusch-Pagan heteroscedasticity test
bp_test  <- bptest(m2)
# Durbin-Watson autocorrelation test
dw_test  <- dwtest(m2)
# Variance Inflation Factor
vif_vals <- vif(m2)

cat("── Diagnostic Tests ─────────────────────────────────────────────────────\n")
── Diagnostic Tests ─────────────────────────────────────────────────────
Code
cat(sprintf("Breusch-Pagan (heteroscedasticity): BP = %.3f, p = %.4f  → %s\n",
            bp_test$statistic, bp_test$p.value,
            if_else(bp_test$p.value > 0.05, "✓ Homoscedastic", "✗ Heteroscedastic")))
Breusch-Pagan (heteroscedasticity): BP = 27.313, p = 0.0001  → ✗ Heteroscedastic
Code
cat(sprintf("Durbin-Watson (autocorrelation):    DW = %.3f, p = %.4f  → %s\n",
            dw_test$statistic, dw_test$p.value,
            if_else(dw_test$p.value > 0.05, "✓ No autocorrelation", "✗ Autocorrelation detected")))
Durbin-Watson (autocorrelation):    DW = 1.287, p = 0.0000  → ✗ Autocorrelation detected
Code
cat("\nVariance Inflation Factors:\n")

Variance Inflation Factors:
Code
print(round(vif_vals, 3))
             log_agents pct_bundled_activations            pct_stacking 
                  1.122                   1.193                   1.740 
           pct_winbacks                pct_view               month_num 
                  2.976                   1.531                   3.110 
Note

Regression Interpretation for Non-Technical Manager:

The model explains approximately 84.9% of the variation in monthly activation volumes across regions and time.

  • Active Agents is by far the most powerful driver. A 10% increase in the number of active agents in a region is associated with approximately a 9.7% increase in activations — all else equal. This directly justifies the business case for agent recruitment drives.

  • Stacking Rate has a significant positive effect: regions where more subscribers renew before expiry tend to generate higher activation volumes in the following period, likely because confident agents sell more when their existing customers stay.

  • Viewership Rate has a negative coefficient — regions with unusually high viewership rates tend to have slightly lower new activation volumes, possibly because saturated regions have less headroom for new subscriber growth.

  • Month is negatively associated with activation volume, confirming the observed H2 2025 downtrend — activations slow as the year progresses, independent of other factors.


10 Section 9 — Simulation

10.1 9.1 Monte Carlo Simulation: Activation Volume Uncertainty

Purpose: Model the uncertainty range around projected monthly activations for a hypothetical “typical region” given variability in active agent count and bundle ratio — two key operational inputs within partnership control.

Code
set.seed(42)

n_sim    <- 10000

# Fit empirical distributions from data
agents_mean <- mean(reg_df$active_agents, na.rm = TRUE)
agents_sd   <- sd(reg_df$active_agents,   na.rm = TRUE)
bundle_mean <- mean(panel$pct_bundled_activations, na.rm = TRUE)
bundle_sd   <- sd(panel$pct_bundled_activations,   na.rm = TRUE)
stack_mean  <- mean(panel$pct_stacking,    na.rm = TRUE)
stack_sd    <- sd(panel$pct_stacking,      na.rm = TRUE)
winback_mean<- mean(panel$pct_winbacks,    na.rm = TRUE)
winback_sd  <- sd(panel$pct_winbacks,      na.rm = TRUE)

# Simulate inputs
sim_agents  <- rnorm(n_sim, agents_mean, agents_sd)
sim_bundle  <- rbeta(n_sim,
                     shape1 = bundle_mean * 100,
                     shape2 = (1 - bundle_mean) * 100)
sim_stack   <- rbeta(n_sim,
                     shape1 = stack_mean * 100,
                     shape2 = (1 - stack_mean) * 100)
sim_winback <- rbeta(n_sim,
                     shape1 = winback_mean * 100,
                     shape2 = (1 - winback_mean) * 100)
sim_month   <- sample(1:12, n_sim, replace = TRUE)
sim_view    <- rnorm(n_sim,
                     mean = mean(panel$pct_view, na.rm=TRUE),
                     sd   = sd(panel$pct_view,   na.rm=TRUE))

# Clip negatives
sim_agents <- pmax(sim_agents, 1)
sim_view   <- pmax(pmin(sim_view, 1), 0)

# Predict via M2 coefficients
b <- coef(m2)
log_pred <- b["(Intercept)"] +
            b["log_agents"]            * log(sim_agents + 1) +
            b["pct_bundled_activations"] * sim_bundle +
            b["pct_stacking"]            * sim_stack +
            b["pct_winbacks"]            * sim_winback +
            b["pct_view"]              * sim_view +
            b["month_num"]             * sim_month

sim_activations <- exp(log_pred)

# Summary statistics
p10 <- quantile(sim_activations, 0.10)
p50 <- quantile(sim_activations, 0.50)
p90 <- quantile(sim_activations, 0.90)

cat("── Monte Carlo Simulation Results (10,000 runs) ─────────────────────────\n")
── Monte Carlo Simulation Results (10,000 runs) ─────────────────────────
Code
cat(sprintf("P10 (Downside scenario):  %s activations\n", comma(round(p10))))
P10 (Downside scenario):  50 activations
Code
cat(sprintf("P50 (Central estimate):   %s activations\n", comma(round(p50))))
P50 (Central estimate):   4,579 activations
Code
cat(sprintf("P90 (Upside scenario):    %s activations\n", comma(round(p90))))
P90 (Upside scenario):    11,439 activations
Code
cat(sprintf("Mean:                     %s activations\n", comma(round(mean(sim_activations)))))
Mean:                     5,247 activations
Code
cat(sprintf("SD:                       %s activations\n", comma(round(sd(sim_activations)))))
SD:                       4,544 activations
Code
sim_df <- tibble(activations = sim_activations)

p_mc <- sim_df %>%
  ggplot(aes(x = activations)) +
  geom_histogram(aes(y = after_stat(density)), bins = 80,
                 fill = sm_cols[4], colour = "white", alpha = 0.7) +
  geom_density(colour = sm_cols[2], linewidth = 1.2) +
  geom_vline(xintercept = p10, colour = sm_cols[1], linetype = "dashed", linewidth = 1) +
  geom_vline(xintercept = p50, colour = sm_cols[8], linetype = "solid",  linewidth = 1.2) +
  geom_vline(xintercept = p90, colour = sm_cols[7], linetype = "dashed", linewidth = 1) +
  annotate("text", x = p10, y = Inf, label = paste0("P10\n", comma(round(p10))),
           vjust = 1.5, colour = sm_cols[1], size = 3.5, fontface = "bold") +
  annotate("text", x = p50, y = Inf, label = paste0("P50\n", comma(round(p50))),
           vjust = 1.5, colour = sm_cols[8], size = 3.5, fontface = "bold") +
  annotate("text", x = p90, y = Inf, label = paste0("P90\n", comma(round(p90))),
           vjust = 1.5, colour = sm_cols[7], size = 3.5, fontface = "bold") +
  scale_x_continuous(labels = comma) +
  labs(title    = "Monte Carlo Simulation: Monthly Activation Distribution",
       subtitle  = paste0("10,000 simulated runs | Inputs: Active Agents, Bundle Rate,",
                          " Stacking, Winbacks, Viewership"),
       x = "Projected Activations", y = "Density",
       caption  = "Red dashed = P10 | Teal solid = P50 (median) | Orange dashed = P90")

ggplotly(p_mc)

10.2 9.2 Tornado Chart — Input Sensitivity

Code
# Vary each input ± 1 SD while holding others at mean, measure output change
base_pred <- exp(b["(Intercept)"] +
                 b["log_agents"]            * log(agents_mean + 1) +
                 b["pct_bundled_activations"] * bundle_mean +
                 b["pct_stacking"]            * stack_mean +
                 b["pct_winbacks"]            * winback_mean +
                 b["pct_view"]              * mean(panel$pct_view, na.rm=TRUE) +
                 b["month_num"]             * 6.5)

vary <- function(var_name, delta_sd) {
  inputs <- c(log(agents_mean + 1), bundle_mean, stack_mean,
              winback_mean, mean(panel$pct_view, na.rm=TRUE), 6.5)
  sds    <- c(sd(log(reg_df$active_agents+1), na.rm=TRUE),
              bundle_sd, stack_sd, winback_sd,
              sd(panel$pct_view, na.rm=TRUE), 0)
  idx    <- match(var_name, c("log_agents","pct_bundled_activations",
                               "pct_stacking","pct_winbacks","pct_view","month_num"))
  inputs[idx] <- inputs[idx] + delta_sd * sds[idx]
  names_b <- c("log_agents","pct_bundled_activations",
               "pct_stacking","pct_winbacks","pct_view","month_num")
  exp(b["(Intercept)"] + sum(b[names_b] * inputs))
}

tornado_df <- tibble(
  Input = c("Log(Active Agents)","% Bundled","% Stacking",
            "% Winbacks","Viewership Rate","Month"),
  Low   = sapply(c("log_agents","pct_bundled_activations","pct_stacking",
                   "pct_winbacks","pct_view","month_num"), vary, delta_sd = -1),
  High  = sapply(c("log_agents","pct_bundled_activations","pct_stacking",
                   "pct_winbacks","pct_view","month_num"), vary, delta_sd = +1)
) %>%
  mutate(
    Range = High - Low,
    Low_delta  = Low  - base_pred,
    High_delta = High - base_pred
  ) %>%
  arrange(Range)

p_tornado <- tornado_df %>%
  ggplot() +
  geom_rect(aes(xmin = Low_delta,  xmax = 0,
                ymin = as.numeric(factor(Input)) - 0.4,
                ymax = as.numeric(factor(Input)) + 0.4),
            fill = sm_cols[1], alpha = 0.8) +
  geom_rect(aes(xmin = 0, xmax = High_delta,
                ymin = as.numeric(factor(Input)) - 0.4,
                ymax = as.numeric(factor(Input)) + 0.4),
            fill = sm_cols[8], alpha = 0.8) +
  geom_vline(xintercept = 0, colour = "black", linewidth = 0.8) +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(breaks = seq_along(tornado_df$Input),
                     labels = tornado_df$Input) +
  labs(title    = "Tornado Chart — Activation Sensitivity to Input Variables",
       subtitle  = "Change in predicted activations when each input varies by ±1 SD",
       x = "Change from Base Prediction", y = NULL,
       caption   = "Red = downside (−1 SD) | Teal = upside (+1 SD)") +
  annotate("text", x = Inf, y = nrow(tornado_df),
           label = paste0("Base: ", comma(round(base_pred))),
           hjust = 1.1, vjust = -0.5, fontface = "bold", size = 4)

ggplotly(p_tornado)
Note

Simulation Insights:

The Monte Carlo simulation confirms that Active Agents is the most influential input variable, accounting for more than 60% of the output variance — visible in the tornado chart’s dominant width. A 1-SD reduction in agent count pushes projected activations down by ~2,477 units; a 1-SD increase lifts activations by ~5,259 units.

The Stacking rate is the second most sensitive input — reflecting that subscriber renewal behaviour materially compounds or erodes the activation base. This reinforces the recommendation to combine agent recruitment with subscriber retention programmes.


11 Section 10 — Integrated Findings

Code
# KPI summary cards via plotly
kpi_vals <- list(
  list(label = "Total 2025\nActivations",
       value = nat_act %>% summarise(s = sum(unique_voucher_activations, na.rm=TRUE)) %>% pull(s) %>% comma()),
  list(label = "Peak Month",
       value = nat_act %>% slice_max(unique_voucher_activations, n=1) %>% pull(period)),
  list(label = "Avg National\nRetention",
       value = paste0(round(mean(disc %>% filter(region=="National") %>%
                                   pull(pct_retention_didn_t_disconnect), na.rm=TRUE)*100, 1), "%")),
  list(label = "Avg M2\nSurvival",
       value = paste0(round(mean(disc %>% filter(region=="National") %>%
                                   pull(pct_m2_survival), na.rm=TRUE)*100, 1), "%")),
  list(label = "Best Retention\nRegion",
       value = reg_disc %>%
         group_by(region) %>%
         summarise(m = mean(pct_retention_didn_t_disconnect, na.rm=TRUE)) %>%
         slice_max(m, n=1) %>% pull(region))
)

# Render as simple annotated text chart
plot_ly() %>%
  add_annotations(
    x     = seq(0.1, 0.9, length.out = 5),
    y     = rep(0.5, 5),
    text  = sapply(kpi_vals, function(k) paste0("<b>", k$value, "</b><br>", k$label)),
    showarrow = FALSE,
    font  = list(size = 20, color = sm_cols[2]),
    xref = "paper", yref = "paper"
  ) %>%
  layout(
    title  = list(text = "2025 Showmax Nigeria Voucher Programme — Key Metrics",
                  font = list(size = 16)),
    xaxis  = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
    yaxis  = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
    height = 200
  )

How the five analyses fit together:

The EDA established the data landscape — activation volumes are right-skewed, dominated by Lagos and seasonal campaigns, with near-universal 1-month subscriptions creating perpetual churn risk. Visualisation translated these patterns into a business story: a strong March peak, declining H2 volume, high disconnect rates in the urban South, and stacking overtaking Gross Adds as the dominant behaviour in Q4.

Hypothesis testing provided statistical rigour to two commercially important questions: Lagos’s retention underperformance is real and not just noise, while the bundle-stacking relationship is directionally interesting. Correlation analysis connected the dots — agents drive volume, bundles may harm retention, and viewing frequency predicts survival. Regression quantified these effects precisely, and Monte Carlo simulation stress-tested the model under uncertainty.

Single Overarching Recommendation:

Shift the voucher channel’s incentive structure from activation-count-only bonuses to a blended scorecard that rewards agents for (1) activations, (2) 3-month+ subscription mix, and (3) M2 subscriber survival. Specifically, deploy a “3-Month Push” pilot in North Central and South South — the two regions with the highest survival rates per active agent — measuring whether the incentive change lifts tenure without suppressing activation volume.


12 Section 11 — Limitations & Further Work

  1. Aggregation bias: All data is at the region-month level. Individual subscriber-level data would enable proper survival analysis (Kaplan-Meier curves, Cox hazard models) and RFM segmentation — the most powerful tools for understanding churn.

  2. March anomaly: The March 2025 campaign peak (~361K activations) is not representative of organic demand. A regression excluding this outlier would yield more conservative — and arguably more policy-relevant — coefficient estimates.

  3. Unmapped records: ~10% of monthly activations are coded as “Unmapped Vouchers” or “Unmapped Activation Codes.” Resolving these geocoding gaps would improve both the regional models and the national totals.

  4. Causal inference: All correlations are observational. Agents are not randomly assigned to regions — regions with better commercial potential may attract more agents, creating reverse causality. A difference-in-differences design using agent recruitment dates as quasi-experimental variation would allow causal identification.

  5. External data: Adding telco subscriber density, per-capita income by state, and competitor pricing data would substantially improve the explanatory power of the regression and the commercial relevance of the simulations.

  6. Time series depth: With 12 monthly observations nationally, formal ARIMA or Prophet forecasting is underpowered. A minimum of 24–36 months would enable credible seasonal decomposition and out-of-sample forecasting.


13 References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Grolemund, G., & Wickham, H. (2017). R for data science. O’Reilly Media.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.4). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Sievert, C. (2020). Interactive web-based data visualization with R, plotly, and shiny. CRC Press. https://plotly-r.com

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

Showmax Nigeria. (2026). Voucher activation and retention dataset, January–December 2025 [Internal data]. Partnerships Department, Showmax Nigeria.


14 Appendix: AI Usage Statement

Claude (Anthropic), an AI coding assistant, was used in the preparation of this document to assist with: (1) structuring the Quarto YAML header and document skeleton, (2) generating boilerplate ggplot2 and plotly code for standard chart types (histograms, boxplots, correlation heatmaps), and (3) suggesting diagnostic test sequences for the regression section.

All analytical decisions — including the selection of the five techniques, the choice of hypothesis tests and their business justifications, the interpretation of every statistical output, the framing of the professional disclosure, the Monte Carlo simulation design, and the final integrated recommendation — were made independently by the author based on domain knowledge of Showmax’s Nigerian voucher operations and the analytical frameworks taught in the Data Analytics 1 course.

The AI did not have access to the underlying dataset at any stage; all data wrangling code was written and tested by the author. Any errors in interpretation remain the author’s own. ```