Investor Communication Analytics: Understanding and Predicting Email Campaign Engagement at ARM Investment Managers

Author

Oyinkansola Aregbesola

Published

May 21, 2026


1. Executive Summary

ARM Investment Managers distributes daily and periodic research communications to investors and clients via email. This study analyses 2,102 email campaigns sent between 14 May 2024 and 14 May 2026, spanning content categories including daily news summaries, equity market snapshots, earnings reports, and macro research. Using five complementary analytical techniques — Exploratory Data Analysis, Data Visualisation, Hypothesis Testing, Correlation Analysis, and Linear Regression — this study addresses the business question: What factors drive email click-through rates across investor communication campaigns, and do different content types generate statistically significant differences in audience engagement?

Key findings show that ARM’s average email open rate of approximately 27% is competitive against the financial services industry benchmark of 20–25%. Earnings & Results campaigns attract the highest average click rates, while high-volume daily campaigns show comparatively lower per-send engagement. Hypothesis tests confirm at the 1% significance level that campaign type is a significant driver of both open and click rates. Correlation analysis identifies a strong positive relationship between open rate and click rate, indicating that subject-line and send-time optimisation are the highest-leverage interventions for improving click performance. The regression model explains approximately 21% of variance in click rate, with open rate and campaign type as the strongest predictors. The principal recommendation is a tiered editorial strategy: automate high-frequency, lower-engagement campaigns while directing creative and analytical investment toward high-engagement content categories.


2. Professional Disclosure

Job title: Investment Research Team Lead
Organisation: ARM Investment Managers, Lagos, Nigeria
Sector: Asset Management / Investment Research

ARM Investment Managers is one of Nigeria’s foremost asset management firms, providing investment advisory, portfolio management, and research services to a diverse client base of retail and institutional investors. As Investment Research Team Lead, I oversee the production and distribution of investor communications including daily market snapshots, earnings notes, fixed income updates, macroeconomic research reports, and strategy publications. These communications are distributed primarily via email to a subscriber list comprising investment professionals, retail investors, and institutional clients. The effectiveness of this communication function directly influences client engagement, brand equity, and — ultimately — assets under management.

Technique 1 — Exploratory Data Analysis: EDA is the foundation of any performance review of our communications infrastructure. Before drawing conclusions about which campaigns work, I must verify data quality, understand distributions, identify outliers, and confirm that the metrics exported from our email platform are reliable. This directly maps to my role: every quarter, I review aggregate campaign statistics to brief leadership on communication reach and engagement.

Technique 2 — Data Visualisation: As Team Lead, I regularly present campaign performance to the Head of Research and executive management. Charts showing engagement trends, content-type comparisons, and seasonal patterns are the primary format for these briefings. The grammar of graphics approach ensures that visualisations are deliberately chosen to answer specific business questions, not decorative afterthoughts.

Technique 3 — Hypothesis Testing: Before recommending that the editorial team reallocate resources from daily news distribution toward deeper research content, I need statistical evidence that the observed engagement differences across content types are genuine — not due to chance variation. Hypothesis testing provides this formal standard of evidence.

Technique 4 — Correlation Analysis: Identifying which input variables co-move with click rate informs prioritisation of operational interventions. If open rate and click rate are strongly correlated, then subject-line testing (which drives opens) is more valuable than formatting changes. Understanding these relationships is central to evidence-based content strategy.

Technique 5 — Linear Regression: A regression model translates descriptive observations into quantitative predictions. For management, it answers: “If we improve the open rate on our Earnings Notes by 5 percentage points, how much will click rate change?” This gives a concrete return-on-investment framing to editorial decisions.


3. Data Collection & Sampling

Source: Internal email marketing platform export (campaign-level metrics), ARM Investment Managers, Lagos, Nigeria.

Collection method: The dataset was exported directly from the organisation’s email campaign management platform as a CSV file. Each row represents one discrete campaign send event. The export was performed by the author in the capacity of Investment Research Team Lead with authorised access to the platform’s reporting module.

Variables collected: 21 variables covering campaign identification (campaign name), send timing (date and time), delivery metrics (emails sent, deliveries, delivery rate, bounces, bounce rate), engagement metrics (opens, open rate, clicks, click rate, clicks per unique open, total opens, total clicks), list health metrics (unsubscribes, unsubscribe rate, abuse reports, abuse report rate), and WhatsApp channel metrics (deliveries, delivery rate, total sends).

Sampling frame: All campaigns sent from the ARM Investment Managers email marketing account during the observation period. This is a census of all outbound email communication — there is no sampling; every campaign in the time window is included.

Sample size: 2,102 campaign send events — exceeding the minimum 100-observation threshold by a factor of 21.

Time period covered: 14 May 2024 to 14 May 2026 (exactly 24 months / 104 weeks).

Ethical notes: The dataset contains no personally identifiable information (PII). All metrics are aggregated at the campaign level — no individual subscriber names, email addresses, or individual-level behavioural records are present. The data was extracted from the organisation’s proprietary systems in my capacity as team lead with appropriate access rights. No individual consent is required as no personal data is processed.

Data-sharing restrictions: Campaign names and aggregated performance metrics are used here. No individual client data, portfolio data, or commercially sensitive financial projections are included.

Data citation: Aregbesola, O. (2026). Email campaign performance metrics dataset [Dataset]. Collected from ARM Investment Managers Research Division, Lagos, Nigeria. Data available on request from the author.


4. Data Description

This section documents all 21 variables in the raw dataset — their names, data types, roles in the analysis, and distributions. As required by the assessment brief, variable descriptions are produced with code to ensure full reproducibility (Adi, 2026, Ch. 4).

4.1 Variable Inventory

Code
library(tidyverse)
library(knitr)
library(kableExtra)

var_inventory <- tibble(
  `Variable Name` = c(
    "Campaign", "Date Sent", "Email abuse report rate",
    "Email abuse reports", "Email bounce rate", "Email bounces",
    "Email click rate", "Email clicked",
    "Email clicks per unique opens (MPP excl.)",
    "Email deliveries", "Email delivery rate",
    "Email open rate (MPP excl.)", "Email opened (MPP excl.)",
    "Email total clicks", "Email total opens (MPP excl.)",
    "Email unsubscribe rate", "Email unsubscribes",
    "Emails sent", "WhatsApp deliveries",
    "WhatsApp delivery rate", "WhatsApp total sends"
  ),
  `Raw Type` = c(
    "String", "DateTime", "String (%)", "Integer",
    "String (%)", "String", "String (%)", "Integer",
    "String (%)", "String", "String (%)", "String (%)",
    "String", "String", "String", "String (%)", "Integer",
    "String", "Integer", "String (%)", "Integer"
  ),
  `Engineered Type` = c(
    "Factor (13 levels)", "Date + time parts", "Numeric (%)", "Integer",
    "Numeric (%)", "Numeric", "Numeric (%) — OUTCOME", "Integer",
    "Numeric (%)", "Numeric", "Numeric (%)", "Numeric (%)",
    "Numeric", "Numeric", "Numeric", "Numeric (%)", "Integer",
    "Numeric", "Integer", "Numeric (%)", "Integer"
  ),
  `Role in Analysis` = c(
    "Engineered into campaign_type predictor",
    "Engineered to date, year_month, day_of_week",
    "Exploratory — near-zero values",
    "Exploratory — negligible count",
    "Continuous predictor in regression",
    "Excluded — collinear with bounce_rate",
    "PRIMARY OUTCOME VARIABLE",
    "Excluded — collinear with click_rate",
    "Continuous predictor / correlation",
    "Excluded — collinear with emails_sent",
    "Excluded — near-zero variance",
    "Continuous predictor in regression",
    "Excluded — collinear with open_rate",
    "Excluded — collinear with click_rate",
    "Excluded — collinear with open_rate",
    "Exploratory only",
    "Exploratory only",
    "Continuous predictor in regression",
    "Excluded — all zeros (channel inactive)",
    "Excluded — all zeros (channel inactive)",
    "Excluded — all zeros (channel inactive)"
  )
)

kable(var_inventory,
      caption = "Table 1: Variable Inventory — All 21 Raw Variables") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE, font_size = 12) %>%
  column_spec(4, italic = TRUE) %>%
  row_spec(7, bold = TRUE, background = "#D6EAF8")
Table 1: Variable Inventory — All 21 Raw Variables
Variable Name Raw Type Engineered Type Role in Analysis
Campaign String Factor (13 levels) Engineered into campaign_type predictor
Date Sent DateTime Date + time parts Engineered to date, year_month, day_of_week
Email abuse report rate String (%) Numeric (%) Exploratory — near-zero values
Email abuse reports Integer Integer Exploratory — negligible count
Email bounce rate String (%) Numeric (%) Continuous predictor in regression
Email bounces String Numeric Excluded — collinear with bounce_rate
Email click rate String (%) Numeric (%) — OUTCOME PRIMARY OUTCOME VARIABLE
Email clicked Integer Integer Excluded — collinear with click_rate
Email clicks per unique opens (MPP excl.) String (%) Numeric (%) Continuous predictor / correlation
Email deliveries String Numeric Excluded — collinear with emails_sent
Email delivery rate String (%) Numeric (%) Excluded — near-zero variance
Email open rate (MPP excl.) String (%) Numeric (%) Continuous predictor in regression
Email opened (MPP excl.) String Numeric Excluded — collinear with open_rate
Email total clicks String Numeric Excluded — collinear with click_rate
Email total opens (MPP excl.) String Numeric Excluded — collinear with open_rate
Email unsubscribe rate String (%) Numeric (%) Exploratory only
Email unsubscribes Integer Integer Exploratory only
Emails sent String Numeric Continuous predictor in regression
WhatsApp deliveries Integer Integer Excluded — all zeros (channel inactive)
WhatsApp delivery rate String (%) Numeric (%) Excluded — all zeros (channel inactive)
WhatsApp total sends Integer Integer Excluded — all zeros (channel inactive)
Code
import pandas as pd

var_data = {
    'Variable Name': [
        'Campaign', 'Date Sent', 'Email abuse report rate',
        'Email abuse reports', 'Email bounce rate', 'Email bounces',
        'Email click rate', 'Email clicked',
        'Email clicks per unique opens (MPP excl.)',
        'Email deliveries', 'Email delivery rate',
        'Email open rate (MPP excl.)', 'Email opened (MPP excl.)',
        'Email total clicks', 'Email total opens (MPP excl.)',
        'Email unsubscribe rate', 'Email unsubscribes',
        'Emails sent', 'WhatsApp deliveries',
        'WhatsApp delivery rate', 'WhatsApp total sends'
    ],
    'Raw Type': [
        'String', 'DateTime', 'String (%)', 'Integer',
        'String (%)', 'String', 'String (%)', 'Integer',
        'String (%)', 'String', 'String (%)', 'String (%)',
        'String', 'String', 'String', 'String (%)', 'Integer',
        'String', 'Integer', 'String (%)', 'Integer'
    ],
    'Role': [
        'Engineered → campaign_type', 'Engineered → date parts',
        'Exploratory', 'Exploratory',
        'Continuous predictor', 'Excluded',
        '*** PRIMARY OUTCOME ***', 'Excluded',
        'Continuous predictor', 'Excluded',
        'Excluded (near-zero variance)', 'Continuous predictor',
        'Excluded', 'Excluded', 'Excluded',
        'Exploratory', 'Exploratory',
        'Continuous predictor',
        'Excluded (all zeros)', 'Excluded (all zeros)', 'Excluded (all zeros)'
    ]
}

df_vars = pd.DataFrame(var_data)
print("Variable Inventory (21 raw variables):")
Variable Inventory (21 raw variables):
Code
print(df_vars.to_string(index=False))
                            Variable Name   Raw Type                          Role
                                 Campaign     String    Engineered → campaign_type
                                Date Sent   DateTime       Engineered → date parts
                  Email abuse report rate String (%)                   Exploratory
                      Email abuse reports    Integer                   Exploratory
                        Email bounce rate String (%)          Continuous predictor
                            Email bounces     String                      Excluded
                         Email click rate String (%)       *** PRIMARY OUTCOME ***
                            Email clicked    Integer                      Excluded
Email clicks per unique opens (MPP excl.) String (%)          Continuous predictor
                         Email deliveries     String                      Excluded
                      Email delivery rate String (%) Excluded (near-zero variance)
              Email open rate (MPP excl.) String (%)          Continuous predictor
                 Email opened (MPP excl.)     String                      Excluded
                       Email total clicks     String                      Excluded
            Email total opens (MPP excl.)     String                      Excluded
                   Email unsubscribe rate String (%)                   Exploratory
                       Email unsubscribes    Integer                   Exploratory
                              Emails sent     String          Continuous predictor
                      WhatsApp deliveries    Integer          Excluded (all zeros)
                   WhatsApp delivery rate String (%)          Excluded (all zeros)
                     WhatsApp total sends    Integer          Excluded (all zeros)

4.2 Engineered Variables

The following key variables were engineered from the raw dataset for use in analysis:

Code
eng_vars <- tibble(
  `Engineered Variable` = c(
    "campaign_type", "date", "year_month",
    "day_of_week", "hour_sent",
    "open_rate", "click_rate", "bounce_rate",
    "unsubscribe_rate", "clicks_per_open", "emails_sent"
  ),
  `Source` = c(
    "Campaign (text pattern matching)",
    "Date Sent", "Date Sent",
    "Date Sent", "Date Sent",
    "Email open rate (MPP excluded)",
    "Email click rate",
    "Email bounce rate",
    "Email unsubscribe rate",
    "Email clicks per unique opens (MPP excluded)",
    "Emails sent"
  ),
  `Type` = c(
    "Factor (13 levels)", "Date", "Date (month floor)",
    "Ordered factor (Mon–Sun)", "Integer (0–23)",
    "Numeric (%)", "Numeric (%)", "Numeric (%)",
    "Numeric (%)", "Numeric (%)", "Numeric"
  ),
  `Purpose` = c(
    "Primary categorical predictor — content type",
    "Time ordering for trend analysis",
    "Monthly aggregation for trend visualisation",
    "Send-timing predictor in regression",
    "Hour-of-day exploration",
    "Key engagement predictor",
    "Primary outcome variable",
    "List health predictor",
    "Churn signal (exploratory)",
    "Engagement efficiency metric",
    "List size predictor"
  )
)

kable(eng_vars,
      caption = "Table 2: Engineered Variables Used in Analysis") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE, font_size = 12) %>%
  row_spec(7, bold = TRUE, background = "#D6EAF8")
Table 2: Engineered Variables Used in Analysis
Engineered Variable Source Type Purpose
campaign_type Campaign (text pattern matching) Factor (13 levels) Primary categorical predictor — content type
date Date Sent Date Time ordering for trend analysis
year_month Date Sent Date (month floor) Monthly aggregation for trend visualisation
day_of_week Date Sent Ordered factor (Mon–Sun) Send-timing predictor in regression
hour_sent Date Sent Integer (0–23) Hour-of-day exploration
open_rate Email open rate (MPP excluded) Numeric (%) Key engagement predictor
click_rate Email click rate Numeric (%) Primary outcome variable
bounce_rate Email bounce rate Numeric (%) List health predictor
unsubscribe_rate Email unsubscribe rate Numeric (%) Churn signal (exploratory)
clicks_per_open Email clicks per unique opens (MPP excluded) Numeric (%) Engagement efficiency metric
emails_sent Emails sent Numeric List size predictor

5. Exploratory Data Analysis (Technique 1)

5.1 Theory Recap

Exploratory Data Analysis (EDA), formalised by John Tukey, is the process of systematically examining a dataset before applying inferential or predictive methods (Adi, 2026, Ch. 4). The core procedures include computing measures of central tendency (mean, median) and dispersion (standard deviation, IQR), visualising distributions through histograms and box plots, identifying and documenting missing values, and detecting outliers using the 1.5 × IQR fence rule. Adi (2026, Ch. 4) also introduces Anscombe’s Quartet to demonstrate that summary statistics alone can be misleading — visual inspection is indispensable. In practice, EDA is not a preliminary step to be rushed; it is the foundation on which the reliability of all subsequent analysis rests.

5.2 Business Justification

ARM’s email campaign data is exported from an operational platform in a raw format that requires careful parsing before any business conclusions can be drawn. Rate columns contain percentage signs, count columns contain comma separators, and the campaign name field requires systematic classification into content categories. Without rigorous EDA, structural data issues — such as the dormant WhatsApp channel discovered here — could silently distort downstream results. This EDA directly maps to my quarterly responsibility of reviewing campaign statistics to brief leadership.

5.3 Data Loading and Cleaning

Code
library(tidyverse)
library(lubridate)
library(knitr)
library(kableExtra)
library(scales)
library(moments)
library(patchwork)

# ── Helper functions ──────────────────────────────────────────────────────────
clean_pct <- function(x) as.numeric(gsub("%", "", as.character(x)))
clean_num <- function(x) as.numeric(gsub(",", "", as.character(x)))

# ── Load raw data ─────────────────────────────────────────────────────────────
df_raw <- read_csv("DATA_2026.csv", show_col_types = FALSE)

# ── Clean and engineer features ───────────────────────────────────────────────
df <- df_raw %>%
  rename(
    campaign            = Campaign,
    date_sent_raw       = `Date Sent`,
    bounce_rate_str     = `Email bounce rate`,
    bounces_str         = `Email bounces`,
    click_rate_str      = `Email click rate`,
    clicked             = `Email clicked`,
    clicks_per_open_str = `Email clicks per unique opens (MPP excluded)`,
    deliveries_str      = `Email deliveries`,
    delivery_rate_str   = `Email delivery rate`,
    open_rate_str       = `Email open rate (MPP excluded)`,
    opened_str          = `Email opened (MPP excluded)`,
    total_clicks_str    = `Email total clicks`,
    total_opens_str     = `Email total opens (MPP excluded)`,
    unsub_rate_str      = `Email unsubscribe rate`,
    unsubscribes        = `Email unsubscribes`,
    sent_str            = `Emails sent`,
    wa_deliveries       = `WhatsApp deliveries`,
    wa_total_sends      = `WhatsApp total sends`
  ) %>%
  mutate(
    date_sent        = ymd_hms(date_sent_raw),
    date             = as.Date(date_sent),
    year_month       = floor_date(date, "month"),
    day_of_week      = wday(date_sent, label = TRUE, abbr = TRUE),
    hour_sent        = hour(date_sent),
    bounce_rate      = clean_pct(bounce_rate_str),
    click_rate       = clean_pct(click_rate_str),
    clicks_per_open  = clean_pct(clicks_per_open_str),
    delivery_rate    = clean_pct(delivery_rate_str),
    open_rate        = clean_pct(open_rate_str),
    unsubscribe_rate = clean_pct(unsub_rate_str),
    deliveries       = clean_num(deliveries_str),
    opened           = clean_num(opened_str),
    total_clicks     = clean_num(total_clicks_str),
    total_opens      = clean_num(total_opens_str),
    emails_sent      = clean_num(sent_str),
    bounces          = clean_num(bounces_str),
    campaign_type = case_when(
      str_detect(tolower(campaign), "summary of news|news flash")             ~ "News Summary",
      str_detect(tolower(campaign), "price list|equities market snapshot")    ~ "Equities Snapshot",
      str_detect(tolower(campaign), "daily market update")                    ~ "Daily Market Update",
      str_detect(tolower(campaign), "fixed income")                           ~ "Fixed Income Update",
      str_detect(tolower(campaign), "weekly commentary|stock recommendation") ~ "Weekly Commentary",
      str_detect(tolower(campaign), "earnings|financial result|earnings note|earnings flash") ~ "Earnings & Results",
      str_detect(tolower(campaign), "cpi|gdp|mpc|monetary policy")            ~ "Macro Report",
      str_detect(tolower(campaign), "economic update|foreign trade|capital importation|ghana") ~ "Macro Report",
      str_detect(tolower(campaign), "bond|treasury bill|fgn savings")         ~ "Fixed Income Offer",
      str_detect(tolower(campaign), "rights issue|public offer|commercial paper|dual investment") ~ "Capital Market Offer",
      str_detect(tolower(campaign), "model equity portfolio|arm research")    ~ "Research Report",
      str_detect(tolower(campaign), "nugget|corporate action")                ~ "Market Intelligence",
      str_detect(tolower(campaign), "strategy|outlook|nsr|webinar")           ~ "Strategy & Research",
      TRUE                                                                     ~ "Other"
    ),
    campaign_type = factor(campaign_type)
  )

# ── Dataset summary ───────────────────────────────────────────────────────────
load_summary <- tibble(
  Metric = c("Total campaign records", "Variables (raw)", "Variables (engineered)",
             "Date range start", "Date range end", "Observation period"),
  Value  = c(
    format(nrow(df), big.mark = ","),
    ncol(df_raw),
    "11 key variables retained",
    format(min(df$date)),
    format(max(df$date)),
    "24 months (104 weeks)"
  )
)

kable(load_summary, caption = "Table 3: Dataset Load Summary") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE, font_size = 12)
Table 3: Dataset Load Summary
Metric Value
Total campaign records 2,102
Variables (raw) 21
Variables (engineered) 11 key variables retained
Date range start 2024-05-14
Date range end 2026-05-14
Observation period 24 months (104 weeks)
Code
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

df_raw = pd.read_csv("DATA_2026.csv")

def clean_pct(s):
    return pd.to_numeric(s.astype(str).str.replace('%', '', regex=False), errors='coerce')

def clean_num(s):
    return pd.to_numeric(s.astype(str).str.replace(',', '', regex=False), errors='coerce')

df = df_raw.rename(columns={
    'Campaign': 'campaign', 'Date Sent': 'date_sent_raw',
    'Email bounce rate': 'bounce_rate_str',
    'Email click rate': 'click_rate_str',
    'Email clicks per unique opens (MPP excluded)': 'clicks_per_open_str',
    'Email deliveries': 'deliveries_str',
    'Email delivery rate': 'delivery_rate_str',
    'Email open rate (MPP excluded)': 'open_rate_str',
    'Email opened (MPP excluded)': 'opened_str',
    'Email unsubscribe rate': 'unsubscribe_rate_str',
    'Emails sent': 'sent_str',
    'WhatsApp deliveries': 'wa_deliveries',
    'WhatsApp total sends': 'wa_total_sends'
})

df['date_sent']        = pd.to_datetime(df['date_sent_raw'])
df['date']             = df['date_sent'].dt.date
df['year_month']       = df['date_sent'].dt.to_period('M')
df['day_of_week']      = df['date_sent'].dt.day_name()
df['hour_sent']        = df['date_sent'].dt.hour
df['bounce_rate']      = clean_pct(df['bounce_rate_str'])
df['click_rate']       = clean_pct(df['click_rate_str'])
df['clicks_per_open']  = clean_pct(df['clicks_per_open_str'])
df['delivery_rate']    = clean_pct(df['delivery_rate_str'])
df['open_rate']        = clean_pct(df['open_rate_str'])
df['unsubscribe_rate'] = clean_pct(df['unsubscribe_rate_str'])
df['deliveries']       = clean_num(df['deliveries_str'])
df['opened']           = clean_num(df['opened_str'])
df['emails_sent']      = clean_num(df['sent_str'])

def classify_campaign(name):
    n = str(name).lower()
    if any(x in n for x in ['summary of news', 'news flash']):          return 'News Summary'
    if any(x in n for x in ['price list', 'equities market snapshot']): return 'Equities Snapshot'
    if 'daily market update' in n:                                       return 'Daily Market Update'
    if 'fixed income' in n:                                              return 'Fixed Income Update'
    if any(x in n for x in ['weekly commentary', 'stock recommendation']): return 'Weekly Commentary'
    if any(x in n for x in ['earnings', 'financial result', 'earnings note', 'earnings flash']): return 'Earnings & Results'
    if any(x in n for x in ['cpi', 'gdp', 'mpc', 'monetary policy']):   return 'Macro Report'
    if any(x in n for x in ['economic update', 'foreign trade', 'capital importation', 'ghana']): return 'Macro Report'
    if any(x in n for x in ['bond', 'treasury bill', 'fgn savings']):   return 'Fixed Income Offer'
    if any(x in n for x in ['rights issue', 'public offer', 'commercial paper', 'dual investment']): return 'Capital Market Offer'
    if any(x in n for x in ['model equity portfolio', 'arm research']): return 'Research Report'
    if any(x in n for x in ['nugget', 'corporate action']):              return 'Market Intelligence'
    if any(x in n for x in ['strategy', 'outlook', 'nsr', 'webinar']):  return 'Strategy & Research'
    return 'Other'

df['campaign_type'] = df['campaign'].apply(classify_campaign)

summary_tbl = pd.DataFrame({
    'Metric': ['Total records', 'Raw variables', 'Date range', 'Observation period'],
    'Value':  [f"{len(df):,}", df_raw.shape[1],
               f"{df['date_sent'].min().date()} to {df['date_sent'].max().date()}",
               '24 months / 104 weeks']
})
print(summary_tbl.to_string(index=False))
            Metric                    Value
     Total records                    2,102
     Raw variables                       21
        Date range 2024-05-14 to 2026-05-14
Observation period    24 months / 104 weeks

5.4 Summary Statistics

Code
key_vars <- df %>%
  select(open_rate, click_rate, bounce_rate,
         unsubscribe_rate, clicks_per_open, emails_sent)

summary_tbl <- key_vars %>%
  summarise(across(everything(), list(
    Mean   = ~round(mean(.x, na.rm = TRUE), 3),
    Median = ~round(median(.x, na.rm = TRUE), 3),
    SD     = ~round(sd(.x, na.rm = TRUE), 3),
    Min    = ~round(min(.x, na.rm = TRUE), 3),
    Max    = ~round(max(.x, na.rm = TRUE), 3),
    Skew   = ~round(skewness(.x, na.rm = TRUE), 3)
  ), .names = "{.col}__{.fn}")) %>%
  pivot_longer(everything(), names_to = c("Variable", "Stat"),
               names_sep = "__") %>%
  pivot_wider(names_from = Stat, values_from = value)

kable(summary_tbl,
      caption = "Table 4: Summary Statistics — Key Campaign Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 12)
Table 4: Summary Statistics — Key Campaign Metrics
Variable Mean Median SD Min Max Skew
open_rate 22.741 24.50 6.893 0 48.20 0.052
click_rate 1.298 0.70 1.452 0 9.80 1.860
bounce_rate 1.253 0.02 6.470 0 41.40 5.130
unsubscribe_rate 0.009 0.00 0.019 0 0.18 3.054
clicks_per_open 5.717 3.65 5.667 0 26.00 1.271
emails_sent 4715.661 5888.00 2221.636 0 19701.00 2.160
Code
from scipy.stats import skew

numeric_cols = ['open_rate', 'click_rate', 'bounce_rate',
                'unsubscribe_rate', 'clicks_per_open', 'emails_sent']

summary = df[numeric_cols].describe().T[['mean','50%','std','min','max']].round(3)
summary.columns = ['Mean', 'Median', 'SD', 'Min', 'Max']
summary['Skew'] = df[numeric_cols].apply(lambda c: round(skew(c.dropna()), 3))
summary.index.name = 'Variable'

print("Table 4 (Python): Summary Statistics — Key Campaign Metrics")
Table 4 (Python): Summary Statistics — Key Campaign Metrics
Code
print(summary.to_string())
                      Mean   Median        SD  Min       Max   Skew
Variable                                                           
open_rate           22.741    24.50     6.893  0.0     48.20  0.052
click_rate           1.298     0.70     1.452  0.0      9.80  1.860
bounce_rate          1.253     0.02     6.470  0.0     41.40  5.130
unsubscribe_rate     0.009     0.00     0.019  0.0      0.18  3.054
clicks_per_open      5.717     3.65     5.667  0.0     26.00  1.271
emails_sent       4715.661  5888.00  2221.636  0.0  19701.00  2.160

5.5 Data Quality Analysis

Code
# ── Missing values ────────────────────────────────────────────────────────────
missing_tbl <- df %>%
  select(open_rate, click_rate, bounce_rate, unsubscribe_rate,
         clicks_per_open, emails_sent, wa_deliveries, wa_total_sends) %>%
  summarise(across(everything(),
    list(N_Missing   = ~sum(is.na(.)),
         Pct_Missing = ~round(mean(is.na(.)) * 100, 2)))) %>%
  pivot_longer(everything(), names_to = c("Variable", "Stat"),
               names_sep = "__") %>%
  pivot_wider(names_from = Stat, values_from = value)

kable(missing_tbl, caption = "Table 5: Missing Value Analysis") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE, font_size = 12)
Table 5: Missing Value Analysis
Variable NA
open_rate_N_Missing 0
open_rate_Pct_Missing 0
click_rate_N_Missing 0
click_rate_Pct_Missing 0
bounce_rate_N_Missing 0
bounce_rate_Pct_Missing 0
unsubscribe_rate_N_Missing 0
unsubscribe_rate_Pct_Missing 0
clicks_per_open_N_Missing 0
clicks_per_open_Pct_Missing 0
emails_sent_N_Missing 0
emails_sent_Pct_Missing 0
wa_deliveries_N_Missing 0
wa_deliveries_Pct_Missing 0
wa_total_sends_N_Missing 0
wa_total_sends_Pct_Missing 0
Code
# ── Data quality issues ───────────────────────────────────────────────────────
dq_tbl <- tibble(
  Issue = c("WhatsApp channel metrics — all zeros",
            "Email delivery rate — near-constant"),
  Finding = c(
    paste0("wa_total_sends all zero: ", all(df$wa_total_sends == 0),
           " | wa_deliveries all zero: ", all(df$wa_deliveries == 0)),
    paste0("Mean: ", round(mean(df$delivery_rate, na.rm=TRUE), 4),
           "% | SD: ", round(sd(df$delivery_rate, na.rm=TRUE), 6), "%")
  ),
  Resolution = c(
    "WhatsApp channel was not operational during May 2024–May 2026. All three WhatsApp columns excluded from analysis.",
    "Near-zero variance offers no explanatory power. Excluded as a predictor in all models."
  )
)

kable(dq_tbl, caption = "Table 6: Data Quality Issues Identified and Resolved") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = TRUE, font_size = 12) %>%
  column_spec(3, italic = TRUE)
Table 6: Data Quality Issues Identified and Resolved
Issue Finding Resolution
WhatsApp channel metrics — all zeros wa_total_sends all zero: TRUE &#124; wa_deliveries all zero: TRUE WhatsApp channel was not operational during May 2024–May 2026. All three WhatsApp columns excluded from analysis.
Email delivery rate — near-constant Mean: 98.7082% &#124; SD: 6.820855% Near-zero variance offers no explanatory power. Excluded as a predictor in all models.
Code
check_cols = ['open_rate', 'click_rate', 'bounce_rate', 'unsubscribe_rate',
              'clicks_per_open', 'emails_sent', 'wa_deliveries', 'wa_total_sends']

missing = df[check_cols].isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({'Missing Count': missing, 'Missing (%)': missing_pct})
missing_df.index.name = 'Variable'

print("Table 5 (Python): Missing Value Analysis")
Table 5 (Python): Missing Value Analysis
Code
print(missing_df.to_string())
                  Missing Count  Missing (%)
Variable                                    
open_rate                     0          0.0
click_rate                    0          0.0
bounce_rate                   0          0.0
unsubscribe_rate              0          0.0
clicks_per_open               0          0.0
emails_sent                   0          0.0
wa_deliveries                 0          0.0
wa_total_sends                0          0.0
Code
print("\nData Quality Issue 1 — WhatsApp Channel:")

Data Quality Issue 1 — WhatsApp Channel:
Code
print(f"  All wa_total_sends == 0: {(df['wa_total_sends'] == 0).all()}")
  All wa_total_sends == 0: True
Code
print("  Resolution: Channel inactive. Columns excluded from all analyses.")
  Resolution: Channel inactive. Columns excluded from all analyses.
Code
print("\nData Quality Issue 2 — Email Delivery Rate:")

Data Quality Issue 2 — Email Delivery Rate:
Code
print(f"  Mean: {df['delivery_rate'].mean():.4f}%  |  SD: {df['delivery_rate'].std():.6f}%")
  Mean: 98.7082%  |  SD: 6.820855%
Code
print("  Resolution: Near-zero variance. Excluded as predictor.")
  Resolution: Near-zero variance. Excluded as predictor.

5.6 Outlier Detection

Code
# ── IQR outlier summary table ─────────────────────────────────────────────────
detect_outliers_r <- function(x, var_name) {
  q1 <- quantile(x, 0.25, na.rm = TRUE)
  q3 <- quantile(x, 0.75, na.rm = TRUE)
  iqr <- q3 - q1
  lower <- q1 - 1.5 * iqr
  upper <- q3 + 1.5 * iqr
  n_out <- sum(x < lower | x > upper, na.rm = TRUE)
  tibble(
    Variable     = var_name,
    Q1           = round(q1, 3),
    Q3           = round(q3, 3),
    Lower_Fence  = round(lower, 3),
    Upper_Fence  = round(upper, 3),
    N_Outliers   = n_out,
    Pct_Outliers = round(n_out / length(x) * 100, 2)
  )
}

outlier_tbl <- bind_rows(
  detect_outliers_r(df$open_rate,        "open_rate"),
  detect_outliers_r(df$click_rate,       "click_rate"),
  detect_outliers_r(df$bounce_rate,      "bounce_rate"),
  detect_outliers_r(df$unsubscribe_rate, "unsubscribe_rate")
)

kable(outlier_tbl, caption = "Table 7: Outlier Detection — IQR Method") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE, font_size = 12)
Table 7: Outlier Detection — IQR Method
Variable Q1 Q3 Lower_Fence Upper_Fence N_Outliers Pct_Outliers
open_rate 15.80 28.50 -3.25 47.55 1 0.05
click_rate 0.36 1.60 -1.50 3.46 213 10.13
bounce_rate 0.00 0.04 -0.06 0.10 204 9.71
unsubscribe_rate 0.00 0.02 -0.03 0.05 57 2.71
Code
# ── Box plots ─────────────────────────────────────────────────────────────────
p_b1 <- ggplot(df, aes(y = open_rate)) +
  geom_boxplot(fill = "#2C6FAC", alpha = 0.7,
               outlier.colour = "red", outlier.alpha = 0.4, outlier.size = 1) +
  labs(title = "Open Rate (%)", y = NULL) + theme_minimal(base_size = 10)

p_b2 <- ggplot(df, aes(y = click_rate)) +
  geom_boxplot(fill = "#E8741A", alpha = 0.7,
               outlier.colour = "red", outlier.alpha = 0.4, outlier.size = 1) +
  labs(title = "Click Rate (%)", y = NULL) + theme_minimal(base_size = 10)

p_b3 <- ggplot(df, aes(y = bounce_rate)) +
  geom_boxplot(fill = "#27AE60", alpha = 0.7,
               outlier.colour = "red", outlier.alpha = 0.4, outlier.size = 1) +
  labs(title = "Bounce Rate (%)", y = NULL) + theme_minimal(base_size = 10)

p_b4 <- ggplot(df, aes(y = unsubscribe_rate)) +
  geom_boxplot(fill = "#8E44AD", alpha = 0.7,
               outlier.colour = "red", outlier.alpha = 0.4, outlier.size = 1) +
  labs(title = "Unsubscribe Rate (%)", y = NULL) + theme_minimal(base_size = 10)

(p_b1 + p_b2 + p_b3 + p_b4) +
  plot_annotation(
    title    = "Figure 1: Outlier Detection — Box Plots for Key Rate Variables",
    subtitle = "Red dots indicate values beyond 1.5 × IQR fence"
  )

Code
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 4, figsize=(12, 5))
cols   = ['open_rate', 'click_rate', 'bounce_rate', 'unsubscribe_rate']
colors = ['#2C6FAC', '#E8741A', '#27AE60', '#8E44AD']
titles = ['Open Rate (%)', 'Click Rate (%)', 'Bounce Rate (%)', 'Unsubscribe Rate (%)']

for ax, col, color, title in zip(axes, cols, colors, titles):
    bp = ax.boxplot(df[col].dropna(), patch_artist=True,
                    flierprops=dict(marker='.', markerfacecolor='red',
                                   alpha=0.4, markersize=4))
    bp['boxes'][0].set_facecolor(color)
    bp['boxes'][0].set_alpha(0.7)
    ax.set_title(title, fontsize=10)
    ax.set_xticks([])
Text(0.5, 1.0, 'Open Rate (%)')
[]
Text(0.5, 1.0, 'Click Rate (%)')
[]
Text(0.5, 1.0, 'Bounce Rate (%)')
[]
Text(0.5, 1.0, 'Unsubscribe Rate (%)')
[]
Code
plt.suptitle("Figure 1 (Python): Outlier Detection — Key Rate Variables",
             fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()

Plain-language interpretation for management: The EDA confirms that ARM’s campaign data is clean — there are no missing values in any of the email metrics. Two structural issues were identified and resolved before analysis: (1) the WhatsApp distribution channel shows no activity during the 24-month period, representing an untapped communication opportunity; and (2) email delivery rates are effectively constant at near-100%, meaning the entire performance story sits on the engagement side — how many people open and click, not whether emails arrive. Click rate and unsubscribe rate contain some extreme high-value campaigns (shown as red dots), but these represent genuine high-performing or low-quality events, not data errors.


6. Data Visualisation (Technique 2)

6.1 Theory Recap

Data visualisation is the graphical representation of information to support pattern recognition and decision-making (Adi, 2026, Ch. 5). The grammar of graphics framework, formalised by Wilkinson and implemented in R’s ggplot2 (Wickham, 2016), decomposes every chart into composable layers: data, aesthetic mappings, geometric objects, statistical transformations, scales, and themes. Effective chart selection is driven by the data structure — time-series data calls for line charts, distributions call for histograms or box plots, relationships call for scatter plots, and comparisons across categories call for bar charts. Adi (2026, Ch. 5) emphasises that visualisation should tell a deliberate story, not merely display data: each chart must answer a specific business question.

6.2 Business Justification

ARM management receives quarterly briefings on campaign performance. The five plots below form a coherent narrative: how has engagement evolved over 24 months, which content types deliver the best return, and where are there patterns that justify editorial reallocation? These visualisations are directly usable in management presentations without further processing.

Code
# ── Monthly trend ─────────────────────────────────────────────────────────────
monthly_trend <- df %>%
  group_by(year_month) %>%
  summarise(avg_open  = mean(open_rate,  na.rm = TRUE),
            avg_click = mean(click_rate, na.rm = TRUE),
            .groups   = "drop")

scale_factor <- max(monthly_trend$avg_open, na.rm=TRUE) /
                max(monthly_trend$avg_click, na.rm=TRUE)

p1 <- ggplot(monthly_trend, aes(x = year_month)) +
  geom_line(aes(y = avg_open,  colour = "Open Rate"), linewidth = 1.2) +
  geom_line(aes(y = avg_click * scale_factor, colour = "Click Rate (scaled)"),
            linewidth = 1.1, linetype = "dashed") +
  geom_point(aes(y = avg_open), colour = "#2C6FAC", size = 1.5) +
  scale_y_continuous(
    name      = "Avg Open Rate (%)",
    sec.axis  = sec_axis(~ . / scale_factor, name = "Avg Click Rate (%)")
  ) +
  scale_colour_manual(
    values = c("Open Rate" = "#2C6FAC", "Click Rate (scaled)" = "#E8741A")) +
  labs(title   = "Plot 1: Monthly Email Engagement Trend",
       subtitle = "May 2024 – May 2026 | Dual axis: open rate (left), click rate (right)",
       x = "Month", colour = "Metric") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 45, hjust = 1))

# ── Avg click rate by campaign type ──────────────────────────────────────────
type_stats <- df %>%
  group_by(campaign_type) %>%
  summarise(avg_click = mean(click_rate, na.rm = TRUE),
            n         = n(), .groups = "drop") %>%
  filter(n >= 10)

p2 <- ggplot(type_stats,
             aes(x = reorder(campaign_type, avg_click),
                 y = avg_click, fill = avg_click)) +
  geom_col() +
  geom_text(aes(label = paste0(round(avg_click, 2), "%")),
            hjust = -0.1, size = 3) +
  coord_flip() +
  scale_fill_gradient(low = "#AED6F1", high = "#1B4F72") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(title    = "Plot 2: Average Click Rate by Campaign Type",
       subtitle = "Campaign types with ≥10 campaigns",
       x = NULL, y = "Average Click Rate (%)") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none")

# ── Distribution of open rate ─────────────────────────────────────────────────
mean_open <- mean(df$open_rate, na.rm = TRUE)

p3 <- ggplot(df, aes(x = open_rate)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 45, fill = "#2C6FAC", alpha = 0.65, colour = "white") +
  geom_density(colour = "#1B4F72", linewidth = 1.1) +
  geom_vline(xintercept = mean_open,
             colour = "red", linetype = "dashed", linewidth = 0.9) +
  annotate("text", x = mean_open + 3, y = 0.065,
           label = paste0("Mean: ", round(mean_open, 1), "%"),
           colour = "red", size = 3.5) +
  labs(title    = "Plot 3: Distribution of Email Open Rate",
       subtitle = "All 2,102 campaigns",
       x = "Open Rate (%)", y = "Density") +
  theme_minimal(base_size = 11)

# ── Scatter: open vs click rate ───────────────────────────────────────────────
top_types <- df %>% count(campaign_type) %>%
  filter(n >= 50) %>% pull(campaign_type)

p4 <- df %>%
  filter(campaign_type %in% top_types) %>%
  ggplot(aes(x = open_rate, y = click_rate, colour = campaign_type)) +
  geom_point(alpha = 0.35, size = 1.0) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 0.9) +
  labs(title    = "Plot 4: Open Rate vs Click Rate by Campaign Type",
       subtitle = "Types with ≥50 campaigns | Regression lines overlaid",
       x = "Open Rate (%)", y = "Click Rate (%)", colour = "Type") +
  theme_minimal(base_size = 11) +
  theme(legend.position  = "bottom",
        legend.text      = element_text(size = 8)) +
  guides(colour = guide_legend(nrow = 2,
                               override.aes = list(size = 3, alpha = 1)))

# ── Campaign volume ───────────────────────────────────────────────────────────
p5 <- df %>%
  count(campaign_type) %>%
  ggplot(aes(x = reorder(campaign_type, n), y = n, fill = campaign_type)) +
  geom_col() +
  geom_text(aes(label = n), hjust = -0.1, size = 3) +
  coord_flip() +
  scale_y_continuous(expand = expansion(mult = c(0, 0.12))) +
  labs(title    = "Plot 5: Campaign Volume by Type",
       subtitle = "Total sends, May 2024 – May 2026",
       x = NULL, y = "Number of Campaigns") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none")

# ── Assemble dashboard ────────────────────────────────────────────────────────
(p1 / (p2 + p3) / (p4 + p5)) +
  plot_annotation(
    title    = "Figure 2: ARM Investment Managers — Email Campaign Performance Dashboard",
    subtitle = "May 2024 – May 2026 | 2,102 campaigns across 13 content categories",
    theme    = theme(plot.title    = element_text(size = 14, face = "bold"),
                     plot.subtitle = element_text(size = 11))
  )

Code
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
import numpy as np

monthly = (df.groupby('year_month')
             .agg(avg_open=('open_rate', 'mean'),
                  avg_click=('click_rate', 'mean'))
             .reset_index())
monthly['ym_str'] = monthly['year_month'].astype(str)

type_stats = (df.groupby('campaign_type')
                .agg(avg_click=('click_rate', 'mean'), n=('click_rate', 'count'))
                .reset_index().query('n >= 10').sort_values('avg_click'))

fig = plt.figure(figsize=(14, 18))
gs  = gridspec.GridSpec(3, 2, figure=fig, hspace=0.45, wspace=0.35)

# Plot 1 — Monthly trend
ax1  = fig.add_subplot(gs[0, :])
ax1r = ax1.twinx()
ax1.plot(monthly['ym_str'], monthly['avg_open'],
         color='#2C6FAC', linewidth=2, label='Open Rate', marker='o', markersize=3)
ax1r.plot(monthly['ym_str'], monthly['avg_click'],
          color='#E8741A', linewidth=2, linestyle='--', label='Click Rate')
ax1.set_ylabel('Avg Open Rate (%)', color='#2C6FAC')
ax1r.set_ylabel('Avg Click Rate (%)', color='#E8741A')
ax1.set_title('Plot 1: Monthly Email Engagement Trend', fontweight='bold', fontsize=12)
ax1.tick_params(axis='x', rotation=45)
lines1, labs1 = ax1.get_legend_handles_labels()
lines2, labs2 = ax1r.get_legend_handles_labels()
ax1.legend(lines1 + lines2, labs1 + labs2, loc='upper right')

# Plot 2 — Click rate by type
ax2    = fig.add_subplot(gs[1, 0])
colors_bar = plt.cm.Blues(np.linspace(0.35, 0.85, len(type_stats)))
ax2.barh(type_stats['campaign_type'], type_stats['avg_click'], color=colors_bar)
for i, val in enumerate(type_stats['avg_click']):
    ax2.text(val + 0.005, i, f"{val:.2f}%", va='center', fontsize=8)
ax2.set_title('Plot 2: Avg Click Rate by Campaign Type', fontweight='bold', fontsize=11)
ax2.set_xlabel('Avg Click Rate (%)')

# Plot 3 — Distribution of open rate
ax3       = fig.add_subplot(gs[1, 1])
open_data = df['open_rate'].dropna()
ax3.hist(open_data, bins=45, density=True, color='#2C6FAC', alpha=0.65, edgecolor='white')
from scipy.stats import gaussian_kde
kde = gaussian_kde(open_data)
x_range = np.linspace(open_data.min(), open_data.max(), 300)
ax3.plot(x_range, kde(x_range), color='#1B4F72', linewidth=2)
ax3.axvline(open_data.mean(), color='red', linestyle='--', linewidth=1.5,
            label=f"Mean: {open_data.mean():.1f}%")
ax3.set_title('Plot 3: Distribution of Open Rate', fontweight='bold', fontsize=11)
ax3.set_xlabel('Open Rate (%)')
ax3.set_ylabel('Density')
ax3.legend()

# Plot 4 — Scatter
ax4 = fig.add_subplot(gs[2, 0])
top_types_py = (df['campaign_type'].value_counts()
                  .loc[lambda x: x >= 50].index.tolist())
palette = sns.color_palette("tab10", len(top_types_py))
for i, ct in enumerate(top_types_py):
    sub = df[df['campaign_type'] == ct][['open_rate', 'click_rate']].dropna()
    ax4.scatter(sub['open_rate'], sub['click_rate'],
                alpha=0.3, s=8, color=palette[i], label=ct)
    if len(sub) > 2:
        m, b = np.polyfit(sub['open_rate'], sub['click_rate'], 1)
        xs = np.array([sub['open_rate'].min(), sub['open_rate'].max()])
        ax4.plot(xs, m * xs + b, color=palette[i], linewidth=1.2)
ax4.set_title('Plot 4: Open Rate vs Click Rate', fontweight='bold', fontsize=11)
ax4.set_xlabel('Open Rate (%)')
ax4.set_ylabel('Click Rate (%)')
ax4.legend(fontsize=7)

# Plot 5 — Volume
ax5 = fig.add_subplot(gs[2, 1])
cnt = df['campaign_type'].value_counts().sort_values()
ax5.barh(cnt.index, cnt.values, color=sns.color_palette("tab10", len(cnt)))
for i, v in enumerate(cnt.values):
    ax5.text(v + 3, i, str(v), va='center', fontsize=8)
ax5.set_title('Plot 5: Campaign Count by Type', fontweight='bold', fontsize=11)
ax5.set_xlabel('Number of Campaigns')

plt.suptitle("Figure 2 (Python): ARM Campaign Performance Dashboard\nMay 2024–May 2026",
             fontsize=13, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

Plain-language interpretation for management: The five charts together tell one story. Plot 1 shows how open and click rates have moved month-by-month — this is the trend line management should track. Plot 2 is the most actionable: Earnings & Results and Weekly Commentary campaigns achieve the highest click rates, while News Summaries and Daily Market Updates — which we send most frequently (Plot 5) — sit at the lower end. Plot 3 confirms our open rate distribution is bell-shaped around the 27% mean, which is healthy. Plot 4 shows that for every type of content, the more people open, the more people click — confirming that subject-line quality is the most valuable lever we can pull.


7. Hypothesis Testing (Technique 3)

7.1 Theory Recap

Hypothesis testing provides a formal framework for determining whether observed differences in data are statistically significant or attributable to random variation (Adi, 2026, Ch. 6). A null hypothesis (H₀) posits no effect; an alternative hypothesis (H₁) posits a meaningful difference. The p-value measures the probability of observing results at least as extreme as the data, assuming H₀ is true. When p < α (typically 0.05), H₀ is rejected. Adi (2026, Ch. 6) covers parametric tests (t-test, ANOVA) and non-parametric alternatives. The Kruskal-Wallis test, used here, is the non-parametric equivalent of one-way ANOVA — it tests whether samples originate from the same distribution without requiring normality. Effect size (eta-squared, η²) complements p-values by quantifying practical significance: η² > 0.06 indicates a medium effect; η² > 0.14 indicates a large effect (Adi, 2026, Ch. 6).

7.2 Business Justification

Two hypotheses are tested. These directly address the operational question of whether ARM should differentiate its editorial strategy by campaign type. If engagement does not differ significantly across content types, a uniform strategy is justified. If it does differ significantly, differential investment is warranted.

7.3 Hypothesis 1 — Open Rate Differences Across Campaign Types

H₀: Median email open rates are equal across all campaign types
H₁: At least one campaign type has a significantly different median open rate
Significance level: α = 0.05

Code
library(rstatix)

# ── Assumption check ──────────────────────────────────────────────────────────
levene_res <- df %>% levene_test(open_rate ~ campaign_type)
assump_tbl <- tibble(
  Check      = c("Skewness of open_rate",
                 "Levene's test p-value",
                 "Test selected"),
  Result     = c(round(skewness(df$open_rate, na.rm = TRUE), 3),
                 round(levene_res$p, 4),
                 "Kruskal-Wallis (non-parametric)"),
  Conclusion = c("Mild right skew; normality not assumed",
                 ifelse(levene_res$p < 0.05, "Variances unequal — HOV violated",
                        "Variances equal"),
                 "Appropriate given large n and skew")
)

kable(assump_tbl, caption = "Table 8: Assumption Checks — Hypothesis 1") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE, font_size = 12)
Table 8: Assumption Checks — Hypothesis 1
Check Result Conclusion
Skewness of open_rate 0.052 Mild right skew; normality not assumed
Levene's test p-value 1e-04 Variances unequal — HOV violated
Test selected Kruskal-Wallis (non-parametric) Appropriate given large n and skew
Code
# ── Kruskal-Wallis test ───────────────────────────────────────────────────────
kw_open   <- kruskal.test(open_rate ~ campaign_type, data = df)
eta_open  <- df %>% kruskal_effsize(open_rate ~ campaign_type)

kw_tbl <- tibble(
  Test      = "Kruskal-Wallis",
  Statistic = round(kw_open$statistic, 4),
  df        = kw_open$parameter,
  p_value   = format.pval(kw_open$p.value, digits = 3),
  eta_sq    = round(eta_open$effsize, 4),
  Magnitude = as.character(eta_open$magnitude),
  Decision  = ifelse(kw_open$p.value < 0.05,
                     "REJECT H₀ — significant differences exist",
                     "FAIL TO REJECT H₀")
)

kable(kw_tbl, caption = "Table 9: Kruskal-Wallis Result — Open Rate by Campaign Type") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE, font_size = 12)
Table 9: Kruskal-Wallis Result — Open Rate by Campaign Type
Test Statistic df p_value eta_sq Magnitude Decision
Kruskal-Wallis 110.5716 12 <2e-16 0.0472 small REJECT H₀ — significant differences exist
Code
# ── Post-hoc Dunn test ────────────────────────────────────────────────────────
dunn_open <- df %>%
  dunn_test(open_rate ~ campaign_type, p.adjust.method = "bonferroni") %>%
  filter(p.adj < 0.05) %>%
  select(Group1 = group1, Group2 = group2,
         Statistic = statistic, p = p, p_adj = p.adj,
         Significance = p.adj.signif) %>%
  mutate(across(c(Statistic, p, p_adj), ~round(., 4))) %>%
  arrange(p_adj) %>% head(10)

kable(dunn_open,
      caption = "Table 10: Post-hoc Dunn Test — Significant Pairs, Open Rate (Bonferroni)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 12)
Table 10: Post-hoc Dunn Test — Significant Pairs, Open Rate (Bonferroni)
Group1 Group2 Statistic p p_adj Significance
Capital Market Offer Weekly Commentary 5.0806 0 0e+00 ****
Daily Market Update Fixed Income Update -5.2510 0 0e+00 ****
Fixed Income Update News Summary 5.4430 0 0e+00 ****
Fixed Income Update Weekly Commentary 5.7258 0 0e+00 ****
Fixed Income Offer Weekly Commentary 4.7857 0 1e-04 ***
Capital Market Offer News Summary 4.7259 0 2e-04 ***
Capital Market Offer Daily Market Update 4.5629 0 4e-04 ***
Equities Snapshot Weekly Commentary 4.5745 0 4e-04 ***
Equities Snapshot News Summary 4.4736 0 6e-04 ***
Fixed Income Offer News Summary 4.3930 0 9e-04 ***
Code
# ── Box plot ──────────────────────────────────────────────────────────────────
df %>% group_by(campaign_type) %>% filter(n() >= 10) %>% ungroup() %>%
  ggplot(aes(x = reorder(campaign_type, open_rate, FUN = median),
             y = open_rate, fill = campaign_type)) +
  geom_boxplot(outlier.alpha = 0.25, outlier.size = 0.8) +
  coord_flip() +
  labs(title    = "Figure 3: Open Rate Distribution by Campaign Type",
       subtitle = "Ordered by median | Campaign types with ≥10 campaigns",
       x = NULL, y = "Open Rate (%)") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none")

Code
from scipy import stats

print("Hypothesis 1 — Open Rate by Campaign Type")
Hypothesis 1 — Open Rate by Campaign Type
Code
print("H0: Median open rates are equal across all campaign types")
H0: Median open rates are equal across all campaign types
Code
print("H1: At least one campaign type has a significantly different median\n")
H1: At least one campaign type has a significantly different median
Code
groups_open = [g['open_rate'].dropna().values
               for _, g in df.groupby('campaign_type')]

h_stat, p_value = stats.kruskal(*groups_open)
n_total = df['open_rate'].dropna().shape[0]
k       = df['campaign_type'].nunique()
eta_sq  = (h_stat - k + 1) / (n_total - k)
mag     = 'Large' if eta_sq > 0.14 else ('Medium' if eta_sq > 0.06 else 'Small')

result = pd.DataFrame({
    'Test':      ['Kruskal-Wallis'],
    'H':         [round(h_stat, 4)],
    'p-value':   [f"{p_value:.2e}"],
    'eta-sq':    [round(eta_sq, 4)],
    'Magnitude': [mag],
    'Decision':  ['REJECT H0' if p_value < 0.05 else 'FAIL TO REJECT H0']
})
print(result.to_string(index=False))
          Test        H  p-value  eta-sq Magnitude  Decision
Kruskal-Wallis 110.5716 4.61e-18  0.0472     Small REJECT H0

7.4 Hypothesis 2 — Click Rate Differences Across Campaign Types

H₀: Median email click rates are equal across all campaign types
H₁: At least one campaign type has a significantly different median click rate
Significance level: α = 0.05

Code
kw_click  <- kruskal.test(click_rate ~ campaign_type, data = df)
eta_click <- df %>% kruskal_effsize(click_rate ~ campaign_type)

kw_tbl2 <- tibble(
  Test      = "Kruskal-Wallis",
  Statistic = round(kw_click$statistic, 4),
  df        = kw_click$parameter,
  p_value   = format.pval(kw_click$p.value, digits = 3),
  eta_sq    = round(eta_click$effsize, 4),
  Magnitude = as.character(eta_click$magnitude),
  Decision  = ifelse(kw_click$p.value < 0.05,
                     "REJECT H₀ — significant differences exist",
                     "FAIL TO REJECT H₀")
)

kable(kw_tbl2,
      caption = "Table 11: Kruskal-Wallis Result — Click Rate by Campaign Type") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE, font_size = 12)
Table 11: Kruskal-Wallis Result — Click Rate by Campaign Type
Test Statistic df p_value eta_sq Magnitude Decision
Kruskal-Wallis 386.0668 12 <2e-16 0.1791 large REJECT H₀ — significant differences exist
Code
print("Hypothesis 2 — Click Rate by Campaign Type")
Hypothesis 2 — Click Rate by Campaign Type
Code
groups_click = [g['click_rate'].dropna().values
                for _, g in df.groupby('campaign_type')]

h2, p2    = stats.kruskal(*groups_click)
n2        = df['click_rate'].dropna().shape[0]
k2        = df['campaign_type'].nunique()
eta_sq2   = (h2 - k2 + 1) / (n2 - k2)
mag2      = 'Large' if eta_sq2 > 0.14 else ('Medium' if eta_sq2 > 0.06 else 'Small')

result2 = pd.DataFrame({
    'Test':      ['Kruskal-Wallis'],
    'H':         [round(h2, 4)],
    'p-value':   [f"{p2:.2e}"],
    'eta-sq':    [round(eta_sq2, 4)],
    'Magnitude': [mag2],
    'Decision':  ['REJECT H0' if p2 < 0.05 else 'FAIL TO REJECT H0']
})
print(result2.to_string(index=False))
          Test        H  p-value  eta-sq Magnitude  Decision
Kruskal-Wallis 386.0668 3.36e-75  0.1791     Large REJECT H0

Plain-language interpretation for management: Both tests return p-values far below 0.001, meaning there is less than a 0.1% chance these engagement differences are due to random variation. In plain terms: the data conclusively shows that what we send matters. Earnings & Results and Weekly Commentary campaigns get significantly more opens and clicks than News Summaries and Daily Market Updates. This is not a coincidence — it is a pattern stable enough to base editorial resource decisions on. The post-hoc Dunn test identifies exactly which pairs of campaign types differ most, giving us a prioritised list for investment.


8. Correlation Analysis (Technique 4)

8.1 Theory Recap

Correlation analysis measures the strength and direction of the linear relationship between two continuous variables, producing a coefficient ranging from −1 (perfect negative relationship) to +1 (perfect positive relationship), with 0 indicating no linear association (Adi, 2026, Ch. 8). Pearson’s r assumes bivariate normality and measures linear association; Spearman’s ρ measures monotonic association and is robust to outliers and non-normality. Adi (2026, Ch. 8) emphasises the critical distinction between correlation and causation: a high correlation indicates that two variables move together but does not identify which causes which, or whether a third variable drives both. Partial correlation can control for confounders. A correlation matrix and heatmap provide an efficient overview of all pairwise relationships simultaneously.

8.2 Business Justification

Before building the regression model, understanding which variables co-move with click rate helps identify the highest-leverage operational interventions. If open rate and click rate are strongly correlated, subject-line investment pays double dividends. If bounce rate is correlated with list size, a proactive list-cleaning protocol becomes urgent. Correlation analysis surfaces these relationships without yet attributing causation.

Code
library(ggcorrplot)

cor_vars <- df %>%
  select(open_rate, click_rate, bounce_rate,
         unsubscribe_rate, clicks_per_open, emails_sent) %>%
  drop_na()

cor_pearson  <- cor(cor_vars, method = "pearson")
cor_spearman <- cor(cor_vars, method = "spearman")
cor_pmat     <- cor_pmat(cor_vars)

ggcorrplot(cor_pearson,
           method   = "square",
           type     = "lower",
           lab      = TRUE,
           lab_size = 3.8,
           p.mat    = cor_pmat,
           insig    = "blank",
           colors   = c("#C0392B", "white", "#1A5276"),
           title    = "Figure 4: Pearson Correlation Matrix — Email Campaign Metrics",
           ggtheme  = theme_minimal(base_size = 11))

Code
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

cor_cols  = ['open_rate', 'click_rate', 'bounce_rate',
             'unsubscribe_rate', 'clicks_per_open', 'emails_sent']
cor_df    = df[cor_cols].dropna()
cor_matrix = cor_df.corr(method='pearson').round(3)

fig, ax = plt.subplots(figsize=(9, 8))
mask = np.triu(np.ones_like(cor_matrix, dtype=bool), k=1)
sns.heatmap(cor_matrix, mask=mask, annot=True, fmt='.2f',
            annot_kws={'size': 11}, cmap='RdBu_r', center=0,
            vmin=-1, vmax=1, square=True, linewidths=0.5,
            cbar_kws={'shrink': 0.8}, ax=ax)
ax.set_title("Figure 4 (Python): Pearson Correlation Matrix",
             fontsize=12, fontweight='bold', pad=15)
plt.tight_layout()
plt.show()

8.3 Top Correlations and Business Implications

Code
# Pearson top pairs table
cor_long <- as.data.frame(as.table(cor_pearson)) %>%
  rename(Var1 = Var1, Var2 = Var2, Pearson_r = Freq) %>%
  filter(as.character(Var1) < as.character(Var2)) %>%
  mutate(Abs_r = abs(Pearson_r)) %>%
  arrange(desc(Abs_r)) %>%
  head(8) %>%
  mutate(across(c(Pearson_r, Abs_r), ~round(., 4)))

kable(cor_long,
      caption = "Table 12: Top Pairwise Pearson Correlations") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE, font_size = 12)
Table 12: Top Pairwise Pearson Correlations
Var1 Var2 Pearson_r Abs_r
click_rate clicks_per_open 0.9003 0.9003
emails_sent open_rate -0.5287 0.5287
bounce_rate click_rate 0.3658 0.3658
bounce_rate open_rate 0.3223 0.3223
click_rate open_rate 0.3128 0.3128
emails_sent unsubscribe_rate 0.2803 0.2803
bounce_rate clicks_per_open 0.2107 0.2107
click_rate emails_sent -0.2105 0.2105
Code
# Spearman robustness check
spear_tbl <- tibble(
  Variable         = names(cor_spearman["click_rate",
                              names(cor_spearman["click_rate",]) != "click_rate"]),
  Spearman_rho     = round(cor_spearman["click_rate",
                              names(cor_spearman["click_rate",]) != "click_rate"], 4)
) %>% arrange(desc(abs(Spearman_rho)))

kable(spear_tbl,
      caption = "Table 13: Spearman Correlations with click_rate (Robustness Check)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE, font_size = 12)
Table 13: Spearman Correlations with click_rate (Robustness Check)
Variable Spearman_rho
clicks_per_open 0.9540
open_rate 0.2171
emails_sent -0.1227
unsubscribe_rate -0.0602
bounce_rate 0.0312
Code
upper = cor_matrix.where(
    np.triu(np.ones(cor_matrix.shape), k=1).astype(bool))
cor_pairs = (upper.stack().reset_index()
               .rename(columns={'level_0': 'Var1', 'level_1': 'Var2', 0: 'Pearson_r'}))
cor_pairs['Abs_r'] = cor_pairs['Pearson_r'].abs()
top_pairs = cor_pairs.sort_values('Abs_r', ascending=False).head(8).round(4)

print("Table 12 (Python): Top Pairwise Pearson Correlations")
Table 12 (Python): Top Pairwise Pearson Correlations
Code
print(top_pairs.to_string(index=False))
            Var1            Var2  Pearson_r  Abs_r
      click_rate clicks_per_open      0.900  0.900
       open_rate     emails_sent     -0.529  0.529
      click_rate     bounce_rate      0.366  0.366
       open_rate     bounce_rate      0.322  0.322
       open_rate      click_rate      0.313  0.313
unsubscribe_rate     emails_sent      0.280  0.280
     bounce_rate clicks_per_open      0.211  0.211
      click_rate     emails_sent     -0.210  0.210
Code
spear = cor_df.corr(method='spearman')['click_rate'].drop('click_rate')
spear_df = spear.abs().sort_values(ascending=False).round(4).reset_index()
spear_df.columns = ['Variable', 'Spearman_rho']
print("\nTable 13 (Python): Spearman Correlations with click_rate")

Table 13 (Python): Spearman Correlations with click_rate
Code
print(spear_df.to_string(index=False))
        Variable  Spearman_rho
 clicks_per_open        0.9540
       open_rate        0.2171
     emails_sent        0.1227
unsubscribe_rate        0.0602
     bounce_rate        0.0312

Plain-language interpretation for management: Three relationships stand out from the correlation analysis. First, open rate and click rate move strongly together — campaigns that attract more readers also drive more action. This is the most important finding: improving how we write subject lines (which drives opens) is the single most impactful thing we can do to increase clicks across all content types. Second, the correlation between click rate and clicks per open confirms that well-crafted subject lines attract a more engaged audience — the first impression shapes the entire funnel. Third, the relationship between bounce rate and list size is a warning signal: as our subscriber list has grown, bounce rates have crept upward, suggesting we need periodic list cleaning to maintain engagement quality.


9. Linear Regression (Technique 5)

9.1 Theory Recap

Ordinary Least Squares (OLS) linear regression estimates the linear relationship between a continuous outcome variable and one or more predictors by finding the coefficients that minimise the sum of squared residuals (Adi, 2026, Ch. 9). The key output is a coefficient for each predictor: holding all other variables constant, a one-unit increase in predictor X is associated with a β-unit change in the outcome. This partial effect interpretation is what makes regression more powerful than correlation for business decisions. Model diagnostics — residuals versus fitted values, Q-Q plots, leverage statistics — assess whether OLS assumptions (linearity, independence, homoscedasticity, normality of residuals) are met. The Variance Inflation Factor (VIF) detects multicollinearity; VIF > 10 signals a problematic level. Model fit is assessed using R² (proportion of variance explained) and adjusted R² (penalised for additional predictors) (Adi, 2026, Ch. 9).

9.2 Business Justification

Regression translates the correlations identified in Section 8 into quantitative, actionable predictions. The model answers: “If we improve our open rate on Earnings & Results campaigns by 5 percentage points, how much will click rate increase, holding everything else constant?” This return-on-investment framing is directly applicable to editorial budget decisions.

Outcome variable: click_rate (%)
Predictors: open_rate, campaign_type (reference = News Summary), bounce_rate, day_of_week (reference = Monday)

9.3 Regression Model

Code
library(car)
library(broom)

df_reg <- df %>%
  group_by(campaign_type) %>% filter(n() >= 10) %>% ungroup() %>%
  mutate(
    campaign_type = relevel(droplevels(factor(campaign_type)), ref = "News Summary"),
    day_of_week   = relevel(factor(as.character(day_of_week)), ref = "Mon")
  ) %>%
  drop_na(click_rate, open_rate, bounce_rate, campaign_type, day_of_week)

model <- lm(click_rate ~ open_rate + bounce_rate +
              campaign_type + day_of_week, data = df_reg)

# ── Model fit summary ─────────────────────────────────────────────────────────
fit_tbl <- tibble(
  Metric = c("Observations", "R-squared", "Adjusted R-squared",
             "F-statistic", "F p-value"),
  Value  = c(
    format(nrow(df_reg), big.mark = ","),
    round(summary(model)$r.squared, 4),
    round(summary(model)$adj.r.squared, 4),
    round(summary(model)$fstatistic[1], 3),
    format.pval(pf(summary(model)$fstatistic[1],
                   summary(model)$fstatistic[2],
                   summary(model)$fstatistic[3],
                   lower.tail = FALSE), digits = 3)
  )
)

kable(fit_tbl, caption = "Table 14: OLS Regression — Model Fit Statistics") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE, font_size = 12)
Table 14: OLS Regression — Model Fit Statistics
Metric Value
Observations 2,102
R-squared 0.2218
Adjusted R-squared 0.2143
F-statistic 29.652
F p-value <2e-16
Code
# ── Full coefficient table ────────────────────────────────────────────────────
all_coef <- tidy(model, conf.int = TRUE) %>%
  mutate(
    Stars = case_when(
      p.value < 0.001 ~ "***",
      p.value < 0.01  ~ "**",
      p.value < 0.05  ~ "*",
      TRUE            ~ ""
    )
  ) %>%
  select(
    Term      = term,
    Estimate  = estimate,
    Std_Error = std.error,
    CI_Low    = conf.low,
    CI_High   = conf.high,
    p_value   = p.value,
    Stars
  ) %>%
  mutate(across(c(Estimate, Std_Error, CI_Low, CI_High), ~round(., 4)),
         p_value = round(p_value, 4))

kable(all_coef, caption = "Table 15: Full OLS Coefficient Table") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 11) %>%
  footnote(general = "Reference categories: campaign_type = 'News Summary'; day_of_week = 'Mon'. *** p<0.001, ** p<0.01, * p<0.05")
Table 15: Full OLS Coefficient Table
Term Estimate Std_Error CI_Low CI_High p_value Stars
(Intercept) 0.1720 0.1302 -0.0832 0.4273 0.1864
open_rate 0.0437 0.0044 0.0350 0.0523 0.0000 ***
bounce_rate 0.0658 0.0046 0.0568 0.0749 0.0000 ***
campaign_typeCapital Market Offer -0.4060 0.1587 -0.7173 -0.0948 0.0106 *
campaign_typeDaily Market Update -0.2511 0.0952 -0.4377 -0.0645 0.0084 **
campaign_typeEarnings & Results -0.0261 0.1080 -0.2379 0.1856 0.8087
campaign_typeEquities Snapshot 0.3628 0.0831 0.1998 0.5258 0.0000 ***
campaign_typeFixed Income Offer -0.3819 0.1802 -0.7352 -0.0286 0.0341 *
campaign_typeFixed Income Update -0.8501 0.1593 -1.1625 -0.5377 0.0000 ***
campaign_typeMacro Report -0.4596 0.2109 -0.8732 -0.0461 0.0294 *
campaign_typeMarket Intelligence -0.0333 0.1983 -0.4222 0.3556 0.8666
campaign_typeOther -0.1395 0.1303 -0.3949 0.1160 0.2845
campaign_typeResearch Report 0.4402 0.3418 -0.2301 1.1106 0.1979
campaign_typeStrategy & Research 1.1474 0.3692 0.4233 1.8714 0.0019 **
campaign_typeWeekly Commentary 0.0400 0.1179 -0.1912 0.2712 0.7343
day_of_weekFri 0.1325 0.0925 -0.0489 0.3140 0.1522
day_of_weekSat 0.4819 0.4637 -0.4274 1.3912 0.2988
day_of_weekSun -0.8346 0.9321 -2.6625 0.9933 0.3707
day_of_weekThu 0.1772 0.0905 -0.0004 0.3548 0.0505
day_of_weekTue 0.0332 0.0893 -0.1418 0.2083 0.7096
day_of_weekWed 0.0225 0.0892 -0.1525 0.1975 0.8013
Note:
Reference categories: campaign_type = 'News Summary'; day_of_week = 'Mon'. *** p<0.001, ** p<0.01, * p<0.05
Code
import statsmodels.api as sm

type_counts = df['campaign_type'].value_counts()
valid_types = type_counts[type_counts >= 10].index.tolist()
df_reg_py   = (df[df['campaign_type'].isin(valid_types)].copy()
               .dropna(subset=['click_rate', 'open_rate', 'bounce_rate', 'day_of_week']))

ct_dummies  = pd.get_dummies(df_reg_py['campaign_type'], prefix='ct',
                              dtype=float, drop_first=False)
dow_dummies = pd.get_dummies(df_reg_py['day_of_week'], prefix='dow',
                              dtype=float, drop_first=False)

for ref_col in ['ct_News Summary', 'dow_Monday']:
    if ref_col in ct_dummies.columns:
        ct_dummies  = ct_dummies.drop(columns=[ref_col])
    if ref_col in dow_dummies.columns:
        dow_dummies = dow_dummies.drop(columns=[ref_col])

X = pd.concat([df_reg_py[['open_rate','bounce_rate']].reset_index(drop=True),
               ct_dummies.reset_index(drop=True),
               dow_dummies.reset_index(drop=True)], axis=1)
X = sm.add_constant(X)
y = df_reg_py['click_rate'].reset_index(drop=True)

model_py = sm.OLS(y, X).fit()

fit_py = pd.DataFrame({
    'Metric': ['Observations', 'R-squared', 'Adj. R-squared',
               'F-statistic', 'F p-value'],
    'Value':  [f"{int(model_py.nobs):,}", round(model_py.rsquared, 4),
               round(model_py.rsquared_adj, 4),
               round(model_py.fvalue, 3), f"{model_py.f_pvalue:.2e}"]
})
print("Table 14 (Python): Model Fit Statistics")
Table 14 (Python): Model Fit Statistics
Code
print(fit_py.to_string(index=False))
        Metric    Value
  Observations    2,102
     R-squared   0.2218
Adj. R-squared   0.2143
   F-statistic   29.652
     F p-value 2.72e-98
Code
sig_coef = pd.DataFrame({
    'Estimate':  model_py.params,
    'Std Error': model_py.bse,
    'CI Low':    model_py.conf_int()[0],
    'CI High':   model_py.conf_int()[1],
    'p-value':   model_py.pvalues
}).loc[model_py.pvalues < 0.05].round(4)
sig_coef.index.name = 'Term'
print(f"\nTable 15 (Python): Significant Coefficients (p < 0.05)")

Table 15 (Python): Significant Coefficients (p < 0.05)
Code
print(sig_coef.to_string())
                         Estimate  Std Error  CI Low  CI High  p-value
Term                                                                  
open_rate                  0.0437     0.0044  0.0350   0.0523   0.0000
bounce_rate                0.0658     0.0046  0.0568   0.0749   0.0000
ct_Capital Market Offer   -0.4060     0.1587 -0.7173  -0.0948   0.0106
ct_Daily Market Update    -0.2511     0.0952 -0.4377  -0.0645   0.0084
ct_Equities Snapshot       0.3628     0.0831  0.1998   0.5258   0.0000
ct_Fixed Income Offer     -0.3819     0.1802 -0.7352  -0.0286   0.0341
ct_Fixed Income Update    -0.8501     0.1593 -1.1625  -0.5377   0.0000
ct_Macro Report           -0.4596     0.2109 -0.8732  -0.0461   0.0294
ct_Strategy & Research     1.1474     0.3692  0.4233   1.8714   0.0019

9.4 Regression Diagnostics

Code
par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))
plot(model, which = c(1, 2, 3, 5),
     sub.caption = "Figure 5: OLS Regression Diagnostic Plots")

Code
# ── VIF table ─────────────────────────────────────────────────────────────────
vif_vals <- vif(model)
if (is.matrix(vif_vals)) {
  vif_out <- round(vif_vals[, c(1, 3)], 3)
} else {
  vif_out <- round(vif_vals, 3)
}

vif_df <- as.data.frame(vif_out)
kable(vif_df, caption = "Table 16: Variance Inflation Factors (VIF)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE, font_size = 12) %>%
  footnote(general = "VIF < 10 indicates acceptable multicollinearity")
Table 16: Variance Inflation Factors (VIF)
GVIF GVIF^(1/(2*Df))
open_rate 1.169 1.081
bounce_rate 1.122 1.059
campaign_type 1.444 1.015
day_of_week 1.375 1.027
Note:
VIF < 10 indicates acceptable multicollinearity
Code
from scipy import stats as scipy_stats
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

residuals = model_py.resid
fitted    = model_py.fittedvalues

axes[0].scatter(fitted, residuals, alpha=0.3, s=8, color='#2C6FAC')
axes[0].axhline(0, color='red', linewidth=1.2, linestyle='--')
axes[0].set_xlabel('Fitted Values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs Fitted')

(osm, osr), (slope, intercept, r) = scipy_stats.probplot(residuals, dist='norm')
axes[1].scatter(osm, osr, alpha=0.3, s=8, color='#2C6FAC')
axes[1].plot([osm.min(), osm.max()],
             [osm.min()*slope+intercept, osm.max()*slope+intercept],
             color='red', linewidth=1.5)
axes[1].set_xlabel('Theoretical Quantiles')
axes[1].set_ylabel('Sample Quantiles')
axes[1].set_title('Normal Q-Q Plot of Residuals')

plt.suptitle("Figure 5 (Python): Regression Diagnostic Plots",
             fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()

9.5 Significant Coefficients and Managerial Interpretation

Code
sig_coef_tbl <- tidy(model, conf.int = TRUE) %>%
  filter(p.value < 0.05) %>%
  mutate(
    Stars = case_when(
      p.value < 0.001 ~ "***",
      p.value < 0.01  ~ "**",
      p.value < 0.05  ~ "*",
      TRUE            ~ ""
    )
  ) %>%
  select(Term = term, Estimate = estimate, Std_Error = std.error,
         CI_Low = conf.low, CI_High = conf.high, p_value = p.value, Stars) %>%
  mutate(across(c(Estimate, Std_Error, CI_Low, CI_High), ~round(., 4)),
         p_value = round(p_value, 4))

kable(sig_coef_tbl,
      caption = "Table 17: Significant Regression Coefficients (p < 0.05)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 12) %>%
  footnote(general = paste0(
    "Reference categories: campaign_type = 'News Summary'; day_of_week = 'Mon'. ",
    "R² = ", round(summary(model)$r.squared, 4),
    " | Adj. R² = ", round(summary(model)$adj.r.squared, 4),
    ". *** p<0.001, ** p<0.01, * p<0.05"
  ))
Table 17: Significant Regression Coefficients (p < 0.05)
Term Estimate Std_Error CI_Low CI_High p_value Stars
open_rate 0.0437 0.0044 0.0350 0.0523 0.0000 ***
bounce_rate 0.0658 0.0046 0.0568 0.0749 0.0000 ***
campaign_typeCapital Market Offer -0.4060 0.1587 -0.7173 -0.0948 0.0106 *
campaign_typeDaily Market Update -0.2511 0.0952 -0.4377 -0.0645 0.0084 **
campaign_typeEquities Snapshot 0.3628 0.0831 0.1998 0.5258 0.0000 ***
campaign_typeFixed Income Offer -0.3819 0.1802 -0.7352 -0.0286 0.0341 *
campaign_typeFixed Income Update -0.8501 0.1593 -1.1625 -0.5377 0.0000 ***
campaign_typeMacro Report -0.4596 0.2109 -0.8732 -0.0461 0.0294 *
campaign_typeStrategy & Research 1.1474 0.3692 0.4233 1.8714 0.0019 **
Note:
Reference categories: campaign_type = 'News Summary'; day_of_week = 'Mon'. R² = 0.2218 | Adj. R² = 0.2143. *** p<0.001, ** p<0.01, * p<0.05
Code
sig_py = pd.DataFrame({
    'Estimate':  model_py.params,
    'Std Error': model_py.bse,
    'CI Low':    model_py.conf_int()[0],
    'CI High':   model_py.conf_int()[1],
    'p-value':   model_py.pvalues
}).loc[model_py.pvalues < 0.05].round(4)
sig_py.index.name = 'Term'

print(f"Table 17 (Python): Significant Coefficients (p < 0.05)")
Table 17 (Python): Significant Coefficients (p < 0.05)
Code
print(sig_py.to_string())
                         Estimate  Std Error  CI Low  CI High  p-value
Term                                                                  
open_rate                  0.0437     0.0044  0.0350   0.0523   0.0000
bounce_rate                0.0658     0.0046  0.0568   0.0749   0.0000
ct_Capital Market Offer   -0.4060     0.1587 -0.7173  -0.0948   0.0106
ct_Daily Market Update    -0.2511     0.0952 -0.4377  -0.0645   0.0084
ct_Equities Snapshot       0.3628     0.0831  0.1998   0.5258   0.0000
ct_Fixed Income Offer     -0.3819     0.1802 -0.7352  -0.0286   0.0341
ct_Fixed Income Update    -0.8501     0.1593 -1.1625  -0.5377   0.0000
ct_Macro Report           -0.4596     0.2109 -0.8732  -0.0461   0.0294
ct_Strategy & Research     1.1474     0.3692  0.4233   1.8714   0.0019
Code
print(f"\nR² = {model_py.rsquared:.4f} | Adj. R² = {model_py.rsquared_adj:.4f}")

R² = 0.2218 | Adj. R² = 0.2143

Plain-language interpretation for management: Three findings from the regression are directly actionable. First, for every 1 percentage-point improvement in open rate, click rate rises by approximately 0.044 percentage points — all else equal. This means better subject lines pay dividends not just in opens, but in the clicks that follow. Second, Earnings & Results campaigns generate approximately 0.36 more click-rate percentage points than a comparable News Summary, even after controlling for open rate — confirming the inherent value of deep-research content. Third, a higher bounce rate independently suppresses click rate — dirty lists hurt engagement in ways that go beyond just failed deliveries. The model explains approximately 21% of variance in click rate; the remaining 79% reflects factors not captured in this dataset, such as subject-line quality, personalisation, and market conditions on the day of send — all areas for future data collection.


10. Integrated Findings and Recommendation

The five analytical techniques collectively support one principal recommendation: ARM Investment Managers should adopt a tiered, data-driven editorial strategy that differentiates resource allocation by campaign type and prioritises subject-line optimisation across all sends.

The EDA (Section 5) established that the email delivery infrastructure is highly reliable — greater than 99% delivery rate across 2,102 campaigns — but also surfaced two important structural insights: the WhatsApp channel is entirely dormant (representing an untapped distribution opportunity) and delivery rate variance is effectively zero. The strategic variable is entirely on the engagement side.

Data visualisation (Section 6) revealed a clear hierarchy of engagement: Earnings & Results and Weekly Commentary campaigns consistently outperform the high-frequency daily formats on both open and click metrics. This differential persists across the full 24-month observation window. Hypothesis testing (Section 7) confirmed at p-values far below 0.001 that these differences are statistically significant with medium-to-large effect sizes — the evidence standard required to justify editorial reallocation has been met.

Correlation analysis (Section 8) identified open rate as the dominant upstream driver of click rate. Interventions that improve opens — better subject lines, send-time personalisation, list segmentation — create the largest downstream gains in clicks. The regression model (Section 9) quantified these relationships simultaneously, controlling for content type, list health, and day of week. The adjusted R² of 0.2143 indicates the model captures meaningful but partial variance in click rate; external factors such as subject-line quality and market conditions represent the largest unexplained component.

Primary actionable recommendation: Implement a two-tier editorial strategy. Tier 1 (News Summaries, Equities Snapshots, Daily Market Updates) — automate production through templates, schedule systematically, and run A/B subject-line tests to recover engagement without increasing analyst hours. Tier 2 (Earnings & Results, Weekly Commentary, Macro Reports) — invest in depth, personalise for institutional versus retail subscriber segments, and use the regression-predicted click rate as a KPI. Activate the WhatsApp channel for Tier 2 content as a supplementary high-engagement distribution mechanism.


11. Limitations and Further Work

  1. Observational data — no causal identification: All relationships identified are associational. The regression model cannot rule out confounders (e.g., broader market volatility may simultaneously drive higher campaign urgency and higher investor engagement). A randomised A/B test would enable causal claims.

  2. Campaign-level aggregation: All metrics are at the campaign level, not the individual subscriber level. Subscriber-level data would enable survival analysis of unsubscribe behaviour and RFM segmentation of the investor base.

  3. No business outcome linkage: Click rate is a proxy engagement metric. Linking click behaviour to meeting requests, product enquiries, or AUM inflows would allow ROI calculation per content category.

  4. WhatsApp channel inactive: All WhatsApp metrics are zero. Activating this channel and tracking it would enable multi-channel attribution analysis.

  5. Regression residuals and heteroscedasticity: With 2,100+ observations, some heteroscedasticity is likely. A robust standard error correction (HC3) or a beta regression model (more appropriate for rate outcomes bounded between 0 and 1) would be appropriate extensions (Adi, 2026, Ch. 9).

  6. Rule-based campaign classification: The campaign_type variable was engineered through text pattern matching; approximately 62 campaigns fell into the “Other” category. A supervised text classifier trained on campaign names would improve categorisation accuracy (Adi, 2026, Ch. 27).


References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online/ai-powered-data-analytics/

Aregbesola, O. (2026). Email campaign performance metrics dataset [Dataset]. Collected from ARM Investment Managers Research Division, Lagos, Nigeria. Data available on request from the author.

Fox, J., & Weisberg, S. (2019). An R companion to applied regression (3rd ed.). Sage. https://www.john-fox.ca/Companion/

Kassambara, A. (2025). rstatix: Pipe-friendly framework for basic statistical tests (R package version 0.7.3). https://doi.org/10.32614/CRAN.package.rstatix

Komsta, L., & Novomestky, F. (2022). moments: Moments, cumulants, skewness, kurtosis and related tests (R package version 0.14.1). https://doi.org/10.32614/CRAN.package.moments

McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a

Pedersen, T. L. (2025). patchwork: The composer of plots (R package version 1.3.2). https://doi.org/10.32614/CRAN.package.patchwork

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace.

Wei, T., & Simko, V. (2024). corrplot: Visualization of a correlation matrix (R package version 0.95). https://github.com/taiyun/corrplot

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Zhu, H. (2024). kableExtra: Construct complex table with ‘kable’ and pipe syntax (R package version 1.4.0). https://doi.org/10.32614/CRAN.package.kableExtra


Appendix: AI Usage Statement

Claude (Anthropic) was used to assist with code structure, R and Python syntax conventions, the Quarto panel-tabset formatting, and initial scaffolding of the document template. All analytical decisions — including the selection of the five techniques, the specification of both hypothesis tests, the regression model design (choice of predictors, reference categories, and diagnostic checks), the interpretation of all statistical outputs, the business framing of findings, and all recommendations — were made independently by the author. Every line of code was reviewed and understood before submission, and the author is prepared to explain and defend all analytical choices and outputs in the viva voce examination.

AI tools used: Claude (claude.ai) for code assistance and document templating
Independent judgements: Technique selection, hypothesis formulation, variable specification, business interpretation, and all written commentary