Investor Communication Analytics: Understanding and Predicting Email Campaign Engagement at ARM Investment Managers
Author
Oyinkansola Aregbesola
Published
May 21, 2026
1. Executive Summary
ARM Investment Managers distributes daily and periodic research communications to investors and clients via email. This study analyses 2,102 email campaigns sent between 14 May 2024 and 14 May 2026, spanning content categories including daily news summaries, equity market snapshots, earnings reports, and macro research. Using five complementary analytical techniques — Exploratory Data Analysis, Data Visualisation, Hypothesis Testing, Correlation Analysis, and Linear Regression — this study addresses the business question: What factors drive email click-through rates across investor communication campaigns, and do different content types generate statistically significant differences in audience engagement?
Key findings show that ARM’s average email open rate of approximately 27% is competitive against the financial services industry benchmark of 20–25%. Earnings & Results campaigns attract the highest average click rates, while high-volume daily campaigns show comparatively lower per-send engagement. Hypothesis tests confirm at the 1% significance level that campaign type is a significant driver of both open and click rates. Correlation analysis identifies a strong positive relationship between open rate and click rate, indicating that subject-line and send-time optimisation are the highest-leverage interventions for improving click performance. The regression model explains approximately 21% of variance in click rate, with open rate and campaign type as the strongest predictors. The principal recommendation is a tiered editorial strategy: automate high-frequency, lower-engagement campaigns while directing creative and analytical investment toward high-engagement content categories.
2. Professional Disclosure
Job title: Investment Research Team Lead Organisation: ARM Investment Managers, Lagos, Nigeria Sector: Asset Management / Investment Research
ARM Investment Managers is one of Nigeria’s foremost asset management firms, providing investment advisory, portfolio management, and research services to a diverse client base of retail and institutional investors. As Investment Research Team Lead, I oversee the production and distribution of investor communications including daily market snapshots, earnings notes, fixed income updates, macroeconomic research reports, and strategy publications. These communications are distributed primarily via email to a subscriber list comprising investment professionals, retail investors, and institutional clients. The effectiveness of this communication function directly influences client engagement, brand equity, and — ultimately — assets under management.
Technique 1 — Exploratory Data Analysis: EDA is the foundation of any performance review of our communications infrastructure. Before drawing conclusions about which campaigns work, I must verify data quality, understand distributions, identify outliers, and confirm that the metrics exported from our email platform are reliable. This directly maps to my role: every quarter, I review aggregate campaign statistics to brief leadership on communication reach and engagement.
Technique 2 — Data Visualisation: As Team Lead, I regularly present campaign performance to the Head of Research and executive management. Charts showing engagement trends, content-type comparisons, and seasonal patterns are the primary format for these briefings. The grammar of graphics approach ensures that visualisations are deliberately chosen to answer specific business questions, not decorative afterthoughts.
Technique 3 — Hypothesis Testing: Before recommending that the editorial team reallocate resources from daily news distribution toward deeper research content, I need statistical evidence that the observed engagement differences across content types are genuine — not due to chance variation. Hypothesis testing provides this formal standard of evidence.
Technique 4 — Correlation Analysis: Identifying which input variables co-move with click rate informs prioritisation of operational interventions. If open rate and click rate are strongly correlated, then subject-line testing (which drives opens) is more valuable than formatting changes. Understanding these relationships is central to evidence-based content strategy.
Technique 5 — Linear Regression: A regression model translates descriptive observations into quantitative predictions. For management, it answers: “If we improve the open rate on our Earnings Notes by 5 percentage points, how much will click rate change?” This gives a concrete return-on-investment framing to editorial decisions.
3. Data Collection & Sampling
Source: Internal email marketing platform export (campaign-level metrics), ARM Investment Managers, Lagos, Nigeria.
Collection method: The dataset was exported directly from the organisation’s email campaign management platform as a CSV file. Each row represents one discrete campaign send event. The export was performed by the author in the capacity of Investment Research Team Lead with authorised access to the platform’s reporting module.
Variables collected: 21 variables covering campaign identification (campaign name), send timing (date and time), delivery metrics (emails sent, deliveries, delivery rate, bounces, bounce rate), engagement metrics (opens, open rate, clicks, click rate, clicks per unique open, total opens, total clicks), list health metrics (unsubscribes, unsubscribe rate, abuse reports, abuse report rate), and WhatsApp channel metrics (deliveries, delivery rate, total sends).
Sampling frame: All campaigns sent from the ARM Investment Managers email marketing account during the observation period. This is a census of all outbound email communication — there is no sampling; every campaign in the time window is included.
Sample size: 2,102 campaign send events — exceeding the minimum 100-observation threshold by a factor of 21.
Time period covered: 14 May 2024 to 14 May 2026 (exactly 24 months / 104 weeks).
Ethical notes: The dataset contains no personally identifiable information (PII). All metrics are aggregated at the campaign level — no individual subscriber names, email addresses, or individual-level behavioural records are present. The data was extracted from the organisation’s proprietary systems in my capacity as team lead with appropriate access rights. No individual consent is required as no personal data is processed.
Data-sharing restrictions: Campaign names and aggregated performance metrics are used here. No individual client data, portfolio data, or commercially sensitive financial projections are included.
Data citation: Aregbesola, O. (2026). Email campaign performance metrics dataset [Dataset]. Collected from ARM Investment Managers Research Division, Lagos, Nigeria. Data available on request from the author.
4. Data Description
This section documents all 21 variables in the raw dataset — their names, data types, roles in the analysis, and distributions. As required by the assessment brief, variable descriptions are produced with code to ensure full reproducibility (Adi, 2026, Ch. 4).
Exploratory Data Analysis (EDA), formalised by John Tukey, is the process of systematically examining a dataset before applying inferential or predictive methods (Adi, 2026, Ch. 4). The core procedures include computing measures of central tendency (mean, median) and dispersion (standard deviation, IQR), visualising distributions through histograms and box plots, identifying and documenting missing values, and detecting outliers using the 1.5 × IQR fence rule. Adi (2026, Ch. 4) also introduces Anscombe’s Quartet to demonstrate that summary statistics alone can be misleading — visual inspection is indispensable. In practice, EDA is not a preliminary step to be rushed; it is the foundation on which the reliability of all subsequent analysis rests.
5.2 Business Justification
ARM’s email campaign data is exported from an operational platform in a raw format that requires careful parsing before any business conclusions can be drawn. Rate columns contain percentage signs, count columns contain comma separators, and the campaign name field requires systematic classification into content categories. Without rigorous EDA, structural data issues — such as the dormant WhatsApp channel discovered here — could silently distort downstream results. This EDA directly maps to my quarterly responsibility of reviewing campaign statistics to brief leadership.
import pandas as pdimport numpy as npimport warningswarnings.filterwarnings('ignore')df_raw = pd.read_csv("DATA_2026.csv")def clean_pct(s):return pd.to_numeric(s.astype(str).str.replace('%', '', regex=False), errors='coerce')def clean_num(s):return pd.to_numeric(s.astype(str).str.replace(',', '', regex=False), errors='coerce')df = df_raw.rename(columns={'Campaign': 'campaign', 'Date Sent': 'date_sent_raw','Email bounce rate': 'bounce_rate_str','Email click rate': 'click_rate_str','Email clicks per unique opens (MPP excluded)': 'clicks_per_open_str','Email deliveries': 'deliveries_str','Email delivery rate': 'delivery_rate_str','Email open rate (MPP excluded)': 'open_rate_str','Email opened (MPP excluded)': 'opened_str','Email unsubscribe rate': 'unsubscribe_rate_str','Emails sent': 'sent_str','WhatsApp deliveries': 'wa_deliveries','WhatsApp total sends': 'wa_total_sends'})df['date_sent'] = pd.to_datetime(df['date_sent_raw'])df['date'] = df['date_sent'].dt.datedf['year_month'] = df['date_sent'].dt.to_period('M')df['day_of_week'] = df['date_sent'].dt.day_name()df['hour_sent'] = df['date_sent'].dt.hourdf['bounce_rate'] = clean_pct(df['bounce_rate_str'])df['click_rate'] = clean_pct(df['click_rate_str'])df['clicks_per_open'] = clean_pct(df['clicks_per_open_str'])df['delivery_rate'] = clean_pct(df['delivery_rate_str'])df['open_rate'] = clean_pct(df['open_rate_str'])df['unsubscribe_rate'] = clean_pct(df['unsubscribe_rate_str'])df['deliveries'] = clean_num(df['deliveries_str'])df['opened'] = clean_num(df['opened_str'])df['emails_sent'] = clean_num(df['sent_str'])def classify_campaign(name): n =str(name).lower()ifany(x in n for x in ['summary of news', 'news flash']): return'News Summary'ifany(x in n for x in ['price list', 'equities market snapshot']): return'Equities Snapshot'if'daily market update'in n: return'Daily Market Update'if'fixed income'in n: return'Fixed Income Update'ifany(x in n for x in ['weekly commentary', 'stock recommendation']): return'Weekly Commentary'ifany(x in n for x in ['earnings', 'financial result', 'earnings note', 'earnings flash']): return'Earnings & Results'ifany(x in n for x in ['cpi', 'gdp', 'mpc', 'monetary policy']): return'Macro Report'ifany(x in n for x in ['economic update', 'foreign trade', 'capital importation', 'ghana']): return'Macro Report'ifany(x in n for x in ['bond', 'treasury bill', 'fgn savings']): return'Fixed Income Offer'ifany(x in n for x in ['rights issue', 'public offer', 'commercial paper', 'dual investment']): return'Capital Market Offer'ifany(x in n for x in ['model equity portfolio', 'arm research']): return'Research Report'ifany(x in n for x in ['nugget', 'corporate action']): return'Market Intelligence'ifany(x in n for x in ['strategy', 'outlook', 'nsr', 'webinar']): return'Strategy & Research'return'Other'df['campaign_type'] = df['campaign'].apply(classify_campaign)summary_tbl = pd.DataFrame({'Metric': ['Total records', 'Raw variables', 'Date range', 'Observation period'],'Value': [f"{len(df):,}", df_raw.shape[1],f"{df['date_sent'].min().date()} to {df['date_sent'].max().date()}",'24 months / 104 weeks']})print(summary_tbl.to_string(index=False))
Metric Value
Total records 2,102
Raw variables 21
Date range 2024-05-14 to 2026-05-14
Observation period 24 months / 104 weeks
# ── Data quality issues ───────────────────────────────────────────────────────dq_tbl <-tibble(Issue =c("WhatsApp channel metrics — all zeros","Email delivery rate — near-constant"),Finding =c(paste0("wa_total_sends all zero: ", all(df$wa_total_sends ==0)," | wa_deliveries all zero: ", all(df$wa_deliveries ==0)),paste0("Mean: ", round(mean(df$delivery_rate, na.rm=TRUE), 4),"% | SD: ", round(sd(df$delivery_rate, na.rm=TRUE), 6), "%") ),Resolution =c("WhatsApp channel was not operational during May 2024–May 2026. All three WhatsApp columns excluded from analysis.","Near-zero variance offers no explanatory power. Excluded as a predictor in all models." ))kable(dq_tbl, caption ="Table 6: Data Quality Issues Identified and Resolved") %>%kable_styling(bootstrap_options =c("striped", "hover"),full_width =TRUE, font_size =12) %>%column_spec(3, italic =TRUE)
Table 6: Data Quality Issues Identified and Resolved
Issue
Finding
Resolution
WhatsApp channel metrics — all zeros
wa_total_sends all zero: TRUE | wa_deliveries all zero: TRUE
WhatsApp channel was not operational during May 2024–May 2026. All three WhatsApp columns excluded from analysis.
Email delivery rate — near-constant
Mean: 98.7082% | SD: 6.820855%
Near-zero variance offers no explanatory power. Excluded as a predictor in all models.
Plain-language interpretation for management: The EDA confirms that ARM’s campaign data is clean — there are no missing values in any of the email metrics. Two structural issues were identified and resolved before analysis: (1) the WhatsApp distribution channel shows no activity during the 24-month period, representing an untapped communication opportunity; and (2) email delivery rates are effectively constant at near-100%, meaning the entire performance story sits on the engagement side — how many people open and click, not whether emails arrive. Click rate and unsubscribe rate contain some extreme high-value campaigns (shown as red dots), but these represent genuine high-performing or low-quality events, not data errors.
6. Data Visualisation (Technique 2)
6.1 Theory Recap
Data visualisation is the graphical representation of information to support pattern recognition and decision-making (Adi, 2026, Ch. 5). The grammar of graphics framework, formalised by Wilkinson and implemented in R’s ggplot2 (Wickham, 2016), decomposes every chart into composable layers: data, aesthetic mappings, geometric objects, statistical transformations, scales, and themes. Effective chart selection is driven by the data structure — time-series data calls for line charts, distributions call for histograms or box plots, relationships call for scatter plots, and comparisons across categories call for bar charts. Adi (2026, Ch. 5) emphasises that visualisation should tell a deliberate story, not merely display data: each chart must answer a specific business question.
6.2 Business Justification
ARM management receives quarterly briefings on campaign performance. The five plots below form a coherent narrative: how has engagement evolved over 24 months, which content types deliver the best return, and where are there patterns that justify editorial reallocation? These visualisations are directly usable in management presentations without further processing.
# ── Monthly trend ─────────────────────────────────────────────────────────────monthly_trend <- df %>%group_by(year_month) %>%summarise(avg_open =mean(open_rate, na.rm =TRUE),avg_click =mean(click_rate, na.rm =TRUE),.groups ="drop")scale_factor <-max(monthly_trend$avg_open, na.rm=TRUE) /max(monthly_trend$avg_click, na.rm=TRUE)p1 <-ggplot(monthly_trend, aes(x = year_month)) +geom_line(aes(y = avg_open, colour ="Open Rate"), linewidth =1.2) +geom_line(aes(y = avg_click * scale_factor, colour ="Click Rate (scaled)"),linewidth =1.1, linetype ="dashed") +geom_point(aes(y = avg_open), colour ="#2C6FAC", size =1.5) +scale_y_continuous(name ="Avg Open Rate (%)",sec.axis =sec_axis(~ . / scale_factor, name ="Avg Click Rate (%)") ) +scale_colour_manual(values =c("Open Rate"="#2C6FAC", "Click Rate (scaled)"="#E8741A")) +labs(title ="Plot 1: Monthly Email Engagement Trend",subtitle ="May 2024 – May 2026 | Dual axis: open rate (left), click rate (right)",x ="Month", colour ="Metric") +theme_minimal(base_size =11) +theme(legend.position ="bottom",axis.text.x =element_text(angle =45, hjust =1))# ── Avg click rate by campaign type ──────────────────────────────────────────type_stats <- df %>%group_by(campaign_type) %>%summarise(avg_click =mean(click_rate, na.rm =TRUE),n =n(), .groups ="drop") %>%filter(n >=10)p2 <-ggplot(type_stats,aes(x =reorder(campaign_type, avg_click),y = avg_click, fill = avg_click)) +geom_col() +geom_text(aes(label =paste0(round(avg_click, 2), "%")),hjust =-0.1, size =3) +coord_flip() +scale_fill_gradient(low ="#AED6F1", high ="#1B4F72") +scale_y_continuous(expand =expansion(mult =c(0, 0.15))) +labs(title ="Plot 2: Average Click Rate by Campaign Type",subtitle ="Campaign types with ≥10 campaigns",x =NULL, y ="Average Click Rate (%)") +theme_minimal(base_size =11) +theme(legend.position ="none")# ── Distribution of open rate ─────────────────────────────────────────────────mean_open <-mean(df$open_rate, na.rm =TRUE)p3 <-ggplot(df, aes(x = open_rate)) +geom_histogram(aes(y =after_stat(density)),bins =45, fill ="#2C6FAC", alpha =0.65, colour ="white") +geom_density(colour ="#1B4F72", linewidth =1.1) +geom_vline(xintercept = mean_open,colour ="red", linetype ="dashed", linewidth =0.9) +annotate("text", x = mean_open +3, y =0.065,label =paste0("Mean: ", round(mean_open, 1), "%"),colour ="red", size =3.5) +labs(title ="Plot 3: Distribution of Email Open Rate",subtitle ="All 2,102 campaigns",x ="Open Rate (%)", y ="Density") +theme_minimal(base_size =11)# ── Scatter: open vs click rate ───────────────────────────────────────────────top_types <- df %>%count(campaign_type) %>%filter(n >=50) %>%pull(campaign_type)p4 <- df %>%filter(campaign_type %in% top_types) %>%ggplot(aes(x = open_rate, y = click_rate, colour = campaign_type)) +geom_point(alpha =0.35, size =1.0) +geom_smooth(method ="lm", se =FALSE, linewidth =0.9) +labs(title ="Plot 4: Open Rate vs Click Rate by Campaign Type",subtitle ="Types with ≥50 campaigns | Regression lines overlaid",x ="Open Rate (%)", y ="Click Rate (%)", colour ="Type") +theme_minimal(base_size =11) +theme(legend.position ="bottom",legend.text =element_text(size =8)) +guides(colour =guide_legend(nrow =2,override.aes =list(size =3, alpha =1)))# ── Campaign volume ───────────────────────────────────────────────────────────p5 <- df %>%count(campaign_type) %>%ggplot(aes(x =reorder(campaign_type, n), y = n, fill = campaign_type)) +geom_col() +geom_text(aes(label = n), hjust =-0.1, size =3) +coord_flip() +scale_y_continuous(expand =expansion(mult =c(0, 0.12))) +labs(title ="Plot 5: Campaign Volume by Type",subtitle ="Total sends, May 2024 – May 2026",x =NULL, y ="Number of Campaigns") +theme_minimal(base_size =11) +theme(legend.position ="none")# ── Assemble dashboard ────────────────────────────────────────────────────────(p1 / (p2 + p3) / (p4 + p5)) +plot_annotation(title ="Figure 2: ARM Investment Managers — Email Campaign Performance Dashboard",subtitle ="May 2024 – May 2026 | 2,102 campaigns across 13 content categories",theme =theme(plot.title =element_text(size =14, face ="bold"),plot.subtitle =element_text(size =11)) )
Code
import matplotlib.pyplot as pltimport matplotlib.gridspec as gridspecimport seaborn as snsimport numpy as npmonthly = (df.groupby('year_month') .agg(avg_open=('open_rate', 'mean'), avg_click=('click_rate', 'mean')) .reset_index())monthly['ym_str'] = monthly['year_month'].astype(str)type_stats = (df.groupby('campaign_type') .agg(avg_click=('click_rate', 'mean'), n=('click_rate', 'count')) .reset_index().query('n >= 10').sort_values('avg_click'))fig = plt.figure(figsize=(14, 18))gs = gridspec.GridSpec(3, 2, figure=fig, hspace=0.45, wspace=0.35)# Plot 1 — Monthly trendax1 = fig.add_subplot(gs[0, :])ax1r = ax1.twinx()ax1.plot(monthly['ym_str'], monthly['avg_open'], color='#2C6FAC', linewidth=2, label='Open Rate', marker='o', markersize=3)ax1r.plot(monthly['ym_str'], monthly['avg_click'], color='#E8741A', linewidth=2, linestyle='--', label='Click Rate')ax1.set_ylabel('Avg Open Rate (%)', color='#2C6FAC')ax1r.set_ylabel('Avg Click Rate (%)', color='#E8741A')ax1.set_title('Plot 1: Monthly Email Engagement Trend', fontweight='bold', fontsize=12)ax1.tick_params(axis='x', rotation=45)lines1, labs1 = ax1.get_legend_handles_labels()lines2, labs2 = ax1r.get_legend_handles_labels()ax1.legend(lines1 + lines2, labs1 + labs2, loc='upper right')# Plot 2 — Click rate by typeax2 = fig.add_subplot(gs[1, 0])colors_bar = plt.cm.Blues(np.linspace(0.35, 0.85, len(type_stats)))ax2.barh(type_stats['campaign_type'], type_stats['avg_click'], color=colors_bar)for i, val inenumerate(type_stats['avg_click']): ax2.text(val +0.005, i, f"{val:.2f}%", va='center', fontsize=8)ax2.set_title('Plot 2: Avg Click Rate by Campaign Type', fontweight='bold', fontsize=11)ax2.set_xlabel('Avg Click Rate (%)')# Plot 3 — Distribution of open rateax3 = fig.add_subplot(gs[1, 1])open_data = df['open_rate'].dropna()ax3.hist(open_data, bins=45, density=True, color='#2C6FAC', alpha=0.65, edgecolor='white')from scipy.stats import gaussian_kdekde = gaussian_kde(open_data)x_range = np.linspace(open_data.min(), open_data.max(), 300)ax3.plot(x_range, kde(x_range), color='#1B4F72', linewidth=2)ax3.axvline(open_data.mean(), color='red', linestyle='--', linewidth=1.5, label=f"Mean: {open_data.mean():.1f}%")ax3.set_title('Plot 3: Distribution of Open Rate', fontweight='bold', fontsize=11)ax3.set_xlabel('Open Rate (%)')ax3.set_ylabel('Density')ax3.legend()# Plot 4 — Scatterax4 = fig.add_subplot(gs[2, 0])top_types_py = (df['campaign_type'].value_counts() .loc[lambda x: x >=50].index.tolist())palette = sns.color_palette("tab10", len(top_types_py))for i, ct inenumerate(top_types_py): sub = df[df['campaign_type'] == ct][['open_rate', 'click_rate']].dropna() ax4.scatter(sub['open_rate'], sub['click_rate'], alpha=0.3, s=8, color=palette[i], label=ct)iflen(sub) >2: m, b = np.polyfit(sub['open_rate'], sub['click_rate'], 1) xs = np.array([sub['open_rate'].min(), sub['open_rate'].max()]) ax4.plot(xs, m * xs + b, color=palette[i], linewidth=1.2)ax4.set_title('Plot 4: Open Rate vs Click Rate', fontweight='bold', fontsize=11)ax4.set_xlabel('Open Rate (%)')ax4.set_ylabel('Click Rate (%)')ax4.legend(fontsize=7)# Plot 5 — Volumeax5 = fig.add_subplot(gs[2, 1])cnt = df['campaign_type'].value_counts().sort_values()ax5.barh(cnt.index, cnt.values, color=sns.color_palette("tab10", len(cnt)))for i, v inenumerate(cnt.values): ax5.text(v +3, i, str(v), va='center', fontsize=8)ax5.set_title('Plot 5: Campaign Count by Type', fontweight='bold', fontsize=11)ax5.set_xlabel('Number of Campaigns')plt.suptitle("Figure 2 (Python): ARM Campaign Performance Dashboard\nMay 2024–May 2026", fontsize=13, fontweight='bold', y=1.01)plt.tight_layout()plt.show()
Plain-language interpretation for management: The five charts together tell one story. Plot 1 shows how open and click rates have moved month-by-month — this is the trend line management should track. Plot 2 is the most actionable: Earnings & Results and Weekly Commentary campaigns achieve the highest click rates, while News Summaries and Daily Market Updates — which we send most frequently (Plot 5) — sit at the lower end. Plot 3 confirms our open rate distribution is bell-shaped around the 27% mean, which is healthy. Plot 4 shows that for every type of content, the more people open, the more people click — confirming that subject-line quality is the most valuable lever we can pull.
7. Hypothesis Testing (Technique 3)
7.1 Theory Recap
Hypothesis testing provides a formal framework for determining whether observed differences in data are statistically significant or attributable to random variation (Adi, 2026, Ch. 6). A null hypothesis (H₀) posits no effect; an alternative hypothesis (H₁) posits a meaningful difference. The p-value measures the probability of observing results at least as extreme as the data, assuming H₀ is true. When p < α (typically 0.05), H₀ is rejected. Adi (2026, Ch. 6) covers parametric tests (t-test, ANOVA) and non-parametric alternatives. The Kruskal-Wallis test, used here, is the non-parametric equivalent of one-way ANOVA — it tests whether samples originate from the same distribution without requiring normality. Effect size (eta-squared, η²) complements p-values by quantifying practical significance: η² > 0.06 indicates a medium effect; η² > 0.14 indicates a large effect (Adi, 2026, Ch. 6).
7.2 Business Justification
Two hypotheses are tested. These directly address the operational question of whether ARM should differentiate its editorial strategy by campaign type. If engagement does not differ significantly across content types, a uniform strategy is justified. If it does differ significantly, differential investment is warranted.
7.3 Hypothesis 1 — Open Rate Differences Across Campaign Types
H₀: Median email open rates are equal across all campaign types H₁: At least one campaign type has a significantly different median open rate Significance level: α = 0.05
Test H p-value eta-sq Magnitude Decision
Kruskal-Wallis 110.5716 4.61e-18 0.0472 Small REJECT H0
7.4 Hypothesis 2 — Click Rate Differences Across Campaign Types
H₀: Median email click rates are equal across all campaign types H₁: At least one campaign type has a significantly different median click rate Significance level: α = 0.05
Test H p-value eta-sq Magnitude Decision
Kruskal-Wallis 386.0668 3.36e-75 0.1791 Large REJECT H0
Plain-language interpretation for management: Both tests return p-values far below 0.001, meaning there is less than a 0.1% chance these engagement differences are due to random variation. In plain terms: the data conclusively shows that what we send matters. Earnings & Results and Weekly Commentary campaigns get significantly more opens and clicks than News Summaries and Daily Market Updates. This is not a coincidence — it is a pattern stable enough to base editorial resource decisions on. The post-hoc Dunn test identifies exactly which pairs of campaign types differ most, giving us a prioritised list for investment.
8. Correlation Analysis (Technique 4)
8.1 Theory Recap
Correlation analysis measures the strength and direction of the linear relationship between two continuous variables, producing a coefficient ranging from −1 (perfect negative relationship) to +1 (perfect positive relationship), with 0 indicating no linear association (Adi, 2026, Ch. 8). Pearson’s r assumes bivariate normality and measures linear association; Spearman’s ρ measures monotonic association and is robust to outliers and non-normality. Adi (2026, Ch. 8) emphasises the critical distinction between correlation and causation: a high correlation indicates that two variables move together but does not identify which causes which, or whether a third variable drives both. Partial correlation can control for confounders. A correlation matrix and heatmap provide an efficient overview of all pairwise relationships simultaneously.
8.2 Business Justification
Before building the regression model, understanding which variables co-move with click rate helps identify the highest-leverage operational interventions. If open rate and click rate are strongly correlated, subject-line investment pays double dividends. If bounce rate is correlated with list size, a proactive list-cleaning protocol becomes urgent. Correlation analysis surfaces these relationships without yet attributing causation.
Plain-language interpretation for management: Three relationships stand out from the correlation analysis. First, open rate and click rate move strongly together — campaigns that attract more readers also drive more action. This is the most important finding: improving how we write subject lines (which drives opens) is the single most impactful thing we can do to increase clicks across all content types. Second, the correlation between click rate and clicks per open confirms that well-crafted subject lines attract a more engaged audience — the first impression shapes the entire funnel. Third, the relationship between bounce rate and list size is a warning signal: as our subscriber list has grown, bounce rates have crept upward, suggesting we need periodic list cleaning to maintain engagement quality.
9. Linear Regression (Technique 5)
9.1 Theory Recap
Ordinary Least Squares (OLS) linear regression estimates the linear relationship between a continuous outcome variable and one or more predictors by finding the coefficients that minimise the sum of squared residuals (Adi, 2026, Ch. 9). The key output is a coefficient for each predictor: holding all other variables constant, a one-unit increase in predictor X is associated with a β-unit change in the outcome. This partial effect interpretation is what makes regression more powerful than correlation for business decisions. Model diagnostics — residuals versus fitted values, Q-Q plots, leverage statistics — assess whether OLS assumptions (linearity, independence, homoscedasticity, normality of residuals) are met. The Variance Inflation Factor (VIF) detects multicollinearity; VIF > 10 signals a problematic level. Model fit is assessed using R² (proportion of variance explained) and adjusted R² (penalised for additional predictors) (Adi, 2026, Ch. 9).
9.2 Business Justification
Regression translates the correlations identified in Section 8 into quantitative, actionable predictions. The model answers: “If we improve our open rate on Earnings & Results campaigns by 5 percentage points, how much will click rate increase, holding everything else constant?” This return-on-investment framing is directly applicable to editorial budget decisions.
Plain-language interpretation for management: Three findings from the regression are directly actionable. First, for every 1 percentage-point improvement in open rate, click rate rises by approximately 0.044 percentage points — all else equal. This means better subject lines pay dividends not just in opens, but in the clicks that follow. Second, Earnings & Results campaigns generate approximately 0.36 more click-rate percentage points than a comparable News Summary, even after controlling for open rate — confirming the inherent value of deep-research content. Third, a higher bounce rate independently suppresses click rate — dirty lists hurt engagement in ways that go beyond just failed deliveries. The model explains approximately 21% of variance in click rate; the remaining 79% reflects factors not captured in this dataset, such as subject-line quality, personalisation, and market conditions on the day of send — all areas for future data collection.
10. Integrated Findings and Recommendation
The five analytical techniques collectively support one principal recommendation: ARM Investment Managers should adopt a tiered, data-driven editorial strategy that differentiates resource allocation by campaign type and prioritises subject-line optimisation across all sends.
The EDA (Section 5) established that the email delivery infrastructure is highly reliable — greater than 99% delivery rate across 2,102 campaigns — but also surfaced two important structural insights: the WhatsApp channel is entirely dormant (representing an untapped distribution opportunity) and delivery rate variance is effectively zero. The strategic variable is entirely on the engagement side.
Data visualisation (Section 6) revealed a clear hierarchy of engagement: Earnings & Results and Weekly Commentary campaigns consistently outperform the high-frequency daily formats on both open and click metrics. This differential persists across the full 24-month observation window. Hypothesis testing (Section 7) confirmed at p-values far below 0.001 that these differences are statistically significant with medium-to-large effect sizes — the evidence standard required to justify editorial reallocation has been met.
Correlation analysis (Section 8) identified open rate as the dominant upstream driver of click rate. Interventions that improve opens — better subject lines, send-time personalisation, list segmentation — create the largest downstream gains in clicks. The regression model (Section 9) quantified these relationships simultaneously, controlling for content type, list health, and day of week. The adjusted R² of 0.2143 indicates the model captures meaningful but partial variance in click rate; external factors such as subject-line quality and market conditions represent the largest unexplained component.
Primary actionable recommendation: Implement a two-tier editorial strategy. Tier 1 (News Summaries, Equities Snapshots, Daily Market Updates) — automate production through templates, schedule systematically, and run A/B subject-line tests to recover engagement without increasing analyst hours. Tier 2 (Earnings & Results, Weekly Commentary, Macro Reports) — invest in depth, personalise for institutional versus retail subscriber segments, and use the regression-predicted click rate as a KPI. Activate the WhatsApp channel for Tier 2 content as a supplementary high-engagement distribution mechanism.
11. Limitations and Further Work
Observational data — no causal identification: All relationships identified are associational. The regression model cannot rule out confounders (e.g., broader market volatility may simultaneously drive higher campaign urgency and higher investor engagement). A randomised A/B test would enable causal claims.
Campaign-level aggregation: All metrics are at the campaign level, not the individual subscriber level. Subscriber-level data would enable survival analysis of unsubscribe behaviour and RFM segmentation of the investor base.
No business outcome linkage: Click rate is a proxy engagement metric. Linking click behaviour to meeting requests, product enquiries, or AUM inflows would allow ROI calculation per content category.
WhatsApp channel inactive: All WhatsApp metrics are zero. Activating this channel and tracking it would enable multi-channel attribution analysis.
Regression residuals and heteroscedasticity: With 2,100+ observations, some heteroscedasticity is likely. A robust standard error correction (HC3) or a beta regression model (more appropriate for rate outcomes bounded between 0 and 1) would be appropriate extensions (Adi, 2026, Ch. 9).
Rule-based campaign classification: The campaign_type variable was engineered through text pattern matching; approximately 62 campaigns fell into the “Other” category. A supervised text classifier trained on campaign names would improve categorisation accuracy (Adi, 2026, Ch. 27).
References
Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online/ai-powered-data-analytics/
Aregbesola, O. (2026). Email campaign performance metrics dataset [Dataset]. Collected from ARM Investment Managers Research Division, Lagos, Nigeria. Data available on request from the author.
Fox, J., & Weisberg, S. (2019). An R companion to applied regression (3rd ed.). Sage. https://www.john-fox.ca/Companion/
Kassambara, A. (2025). rstatix: Pipe-friendly framework for basic statistical tests (R package version 0.7.3). https://doi.org/10.32614/CRAN.package.rstatix
Komsta, L., & Novomestky, F. (2022). moments: Moments, cumulants, skewness, kurtosis and related tests (R package version 0.14.1). https://doi.org/10.32614/CRAN.package.moments
McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a
Pedersen, T. L. (2025). patchwork: The composer of plots (R package version 1.3.2). https://doi.org/10.32614/CRAN.package.patchwork
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/
Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace.
Wei, T., & Simko, V. (2024). corrplot: Visualization of a correlation matrix (R package version 0.95). https://github.com/taiyun/corrplot
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Zhu, H. (2024). kableExtra: Construct complex table with ‘kable’ and pipe syntax (R package version 1.4.0). https://doi.org/10.32614/CRAN.package.kableExtra
Appendix: AI Usage Statement
Claude (Anthropic) was used to assist with code structure, R and Python syntax conventions, the Quarto panel-tabset formatting, and initial scaffolding of the document template. All analytical decisions — including the selection of the five techniques, the specification of both hypothesis tests, the regression model design (choice of predictors, reference categories, and diagnostic checks), the interpretation of all statistical outputs, the business framing of findings, and all recommendations — were made independently by the author. Every line of code was reviewed and understood before submission, and the author is prepared to explain and defend all analytical choices and outputs in the viva voce examination.
AI tools used: Claude (claude.ai) for code assistance and document templating Independent judgements: Technique selection, hypothesis formulation, variable specification, business interpretation, and all written commentary