Digital Marketing Agency Scenario: A mid-sized digital marketing agency, “DataDriven Marketing Solutions,” manages online advertising campaigns for 50+ e-commerce clients. The agency is testing whether adding display creative elements (enhanced visuals, animations, call-to-action buttons) to standard text ads increases campaign performance for their clients.
Primary Stakeholders: - Account Managers: Need to recommend optimal ad strategies to clients - Client Business Owners: Deciding whether to allocate budget to premium display ads (cost +$2-5 per ad) - Media Buyers: Executing campaigns and optimizing spend
Secondary Stakeholders: - Creative Team: Designing display elements - Data Analytics Team: Monitoring campaign performance - Finance Team: ROI analysis and budget allocation
Current Situation: DataDriven Marketing Solutions runs both standard text ads and enhanced display ads for clients, but lacks rigorous evidence on whether the additional cost of display ads (+$2-5 per ad) generates sufficient return on investment through increased revenue or click-through rates.
Problem: By Q2 2024, the agency must determine whether display ads generate statistically significant and practically meaningful improvements in campaign metrics (revenue per day, click-through rate) to justify recommending them to clients and to inform pricing strategy for premium ad packages.
Specific Measurables: - Minimum revenue increase to justify display ad cost: +$100/day per client - Target click-through rate (CTR) improvement: +0.5 percentage points - Timeline: Analysis based on 100 days of A/B test data - Success metric: Statistical significance (p < 0.05) AND practical significance (effect size d ≥ 0.5)
Variables from Week 7 Lab:
From the marketing dataset: - display (binary): 0 = standard ad, 1 = display ad - revenue (continuous): Daily revenue generated (in dollars) - clicks (count): Number of ad clicks - impressions (count): Number of times ad was shown - CTR (derived): Click-through rate = clicks / impressions
Analyses Used: 1. Two-sample t-test for revenue difference 2. Proportion test for CTR difference 3. Effect size calculation (Cohen’s d) 4. Power analysis for sample size determination 5. Bootstrapping for confidence intervals
Assumptions: - Independence: Each day’s performance is independent (no carryover effects) - Random assignment: Days are randomly assigned to display vs. non-display (questionable - see critique) - Normal distribution: Revenue differences approximately normal (needs verification) - Equal variance: Display and non-display groups have similar variance (testable) - No confounders: No seasonal effects, competitor activity, or external events (strong assumption - see ethical concerns)
Success Criteria (Past Tense):
By the end of this analysis, the following will have been achieved:
Revenue Impact Quantified: The average daily revenue difference between display and non-display ads will have been determined with 95% confidence intervals, and practical significance (effect size) will have been assessed.
CTR Impact Identified: Whether display ads increase click-through rate by a meaningful margin (≥0.5 percentage points) will have been established with statistical evidence.
Cost-Benefit Analysis Completed: The minimum revenue increase needed to justify display ad costs ($100/day) will have been compared to the observed effect, enabling a clear recommendation.
Sample Size Validated: Whether the collected data (100 days) provides sufficient statistical power (≥80%) to detect meaningful effects will have been verified.
Recommendation Delivered: A data-driven recommendation (adopt display ads, reject them, or continue testing) will have been provided to account managers with clear rationale and confidence levels.
The Week 7 lab conducted: 1. Bootstrapped sampling distribution for
revenue difference 2. Cohen’s d effect size calculation 3. Direct
simulation test with p-value 4. Sample size calculation using
pwrss package 5. Normal test of equal proportions for
CTR
Key findings from original lab: - Observed revenue difference: ~$93.60 - P-value: ~0.35 (not statistically significant at α = 0.05) - Effect size: Not explicitly calculated in full
Problem: The lab never verifies normality, equal variance, or independence assumptions before conducting t-tests and normal approximations.
Why it matters: Violations can lead to incorrect p-values, confidence intervals, and conclusions. For business decisions involving budget allocation, this is unacceptable.
Improved Analysis 1: Comprehensive Assumption Testing
# Load marketing data
url_ <- "https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/marketing/marketing.csv"
marketing <- read_delim(url_, delim = ",", show_col_types = FALSE)
# Create CTR variable
marketing <- marketing |>
mutate(CTR = clicks / impressions)
# Separate groups
revenue_display <- marketing |> filter(display == 1) |> pull(revenue)
revenue_no_display <- marketing |> filter(display == 0) |> pull(revenue)
cat("=== ASSUMPTION TESTS ===\n\n")
## === ASSUMPTION TESTS ===
# 1. Normality Test (Shapiro-Wilk)
cat("1. NORMALITY TEST (Shapiro-Wilk)\n")
## 1. NORMALITY TEST (Shapiro-Wilk)
cat(" Display group p-value:",
shapiro.test(revenue_display)$p.value, "\n")
## Display group p-value: 0.5180526
cat(" Non-display group p-value:",
shapiro.test(revenue_no_display)$p.value, "\n")
## Non-display group p-value: 0.4131173
cat(" Interpretation: p > 0.05 suggests normality assumption is met\n\n")
## Interpretation: p > 0.05 suggests normality assumption is met
# 2. Equal Variance Test (Levene's Test via car package or F-test)
cat("2. EQUAL VARIANCE TEST (F-test)\n")
## 2. EQUAL VARIANCE TEST (F-test)
var_test <- var.test(revenue_display, revenue_no_display)
cat(" F-statistic:", var_test$statistic, "\n")
## F-statistic: 1.214676
cat(" P-value:", var_test$p.value, "\n")
## P-value: 0.6759531
cat(" Interpretation: p > 0.05 suggests equal variance assumption is met\n\n")
## Interpretation: p > 0.05 suggests equal variance assumption is met
# 3. Visual Diagnostic: Q-Q Plots
par(mfrow = c(1, 2))
qqnorm(revenue_display, main = "Q-Q Plot: Display Ads")
qqline(revenue_display, col = "red")
qqnorm(revenue_no_display, main = "Q-Q Plot: Non-Display Ads")
qqline(revenue_no_display, col = "red")
par(mfrow = c(1, 1))
cat("\n3. VISUAL ASSESSMENT\n")
##
## 3. VISUAL ASSESSMENT
cat(" Q-Q plots should show points following the red line if normal\n")
## Q-Q plots should show points following the red line if normal
Findings:
If assumptions are violated: - Normality violated → Use non-parametric tests (Mann-Whitney U / Wilcoxon rank-sum) - Equal variance violated → Use Welch’s t-test (unequal variances) - Both violated → Bootstrap or permutation tests are more appropriate
Problem: The lab treats each day as independent, but marketing campaigns have temporal patterns (day of week effects, seasonality, trend over time). Revenue on consecutive days might be correlated.
Why it matters: Violates independence assumption, inflating Type I error rate. Real campaigns have: - Weekend vs. weekday effects - Learning/optimization over time - External events (holidays, competitor campaigns)
Improved Analysis 2: Time Series Considerations
# Check for temporal patterns
cat("=== TEMPORAL DEPENDENCE CHECK ===\n\n")
## === TEMPORAL DEPENDENCE CHECK ===
# Add time index
marketing <- marketing |>
mutate(day_index = row_number())
# Test for autocorrelation in residuals
lm_revenue <- lm(revenue ~ display, data = marketing)
acf(residuals(lm_revenue), main = "Autocorrelation of Residuals")
# Test for trend over time
cat("Correlation between revenue and time:\n")
## Correlation between revenue and time:
cat(" Display group:",
cor.test(marketing |> filter(display == 1) |> pull(day_index),
marketing |> filter(display == 1) |> pull(revenue))$estimate, "\n")
## Display group: 0.5625481
cat(" Non-display group:",
cor.test(marketing |> filter(display == 0) |> pull(day_index),
marketing |> filter(display == 0) |> pull(revenue))$estimate, "\n\n")
## Non-display group: 0.5639309
# Visualize revenue over time
ggplot(marketing, aes(x = day_index, y = revenue, color = factor(display))) +
geom_line(alpha = 0.6) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = TRUE) +
scale_color_manual(values = c("0" = "#e74c3c", "1" = "#27ae60"),
labels = c("Non-Display", "Display")) +
labs(title = "Revenue Over Time by Ad Type",
subtitle = "Checking for temporal trends and patterns",
x = "Day Index", y = "Revenue ($)", color = "Ad Type") +
theme_minimal()
cat("If significant trends exist, use time series methods or mixed models\n")
## If significant trends exist, use time series methods or mixed models
Recommendation: If autocorrelation detected, use: - Mixed-effects model with time random effect - Time series regression (ARIMA with intervention variable) - Block randomization analysis (compare display vs. non-display within weekly blocks)
Problem: The lab tests multiple hypotheses (revenue AND CTR) without adjusting for multiple comparisons. This inflates family-wise error rate.
Why it matters: Testing 2 hypotheses at α = 0.05 each gives family-wise error rate of ~0.10 (10% chance of at least one false positive).
Improved Analysis 3: Multiple Testing Correction
cat("=== MULTIPLE TESTING CORRECTION ===\n\n")
## === MULTIPLE TESTING CORRECTION ===
# Calculate both p-values
# Revenue test
t_test_revenue <- t.test(revenue ~ display, data = marketing)
p_revenue <- t_test_revenue$p.value
# CTR test
ctr_display <- marketing |> filter(display == 1) |> pull(CTR)
ctr_no_display <- marketing |> filter(display == 0) |> pull(CTR)
t_test_ctr <- t.test(ctr_display, ctr_no_display)
p_ctr <- t_test_ctr$p.value
# Bonferroni correction (most conservative)
p_values <- c(revenue = p_revenue, CTR = p_ctr)
adjusted_p <- p.adjust(p_values, method = "bonferroni")
# Holm-Bonferroni (less conservative, more powerful)
adjusted_p_holm <- p.adjust(p_values, method = "holm")
# Results table
results_df <- data.frame(
Test = c("Revenue Difference", "CTR Difference"),
Raw_P_Value = p_values,
Bonferroni_Adjusted = adjusted_p,
Holm_Adjusted = adjusted_p_holm,
Significant_at_0.05 = adjusted_p_holm < 0.05
)
kable(results_df,
col.names = c("Test", "Raw p-value", "Bonferroni", "Holm", "Significant?"),
digits = 4,
caption = "Multiple Testing Correction Results")
| Test | Raw p-value | Bonferroni | Holm | Significant? | |
|---|---|---|---|---|---|
| revenue | Revenue Difference | 0.0485 | 0.0969 | 0.0485 | TRUE |
| CTR | CTR Difference | 0.0000 | 0.0000 | 0.0000 | TRUE |
cat("\nInterpretation:\n")
##
## Interpretation:
cat("- Use adjusted p-values when testing multiple hypotheses\n")
## - Use adjusted p-values when testing multiple hypotheses
cat("- Holm method is less conservative than Bonferroni\n")
## - Holm method is less conservative than Bonferroni
cat("- Only declare significance if adjusted p < 0.05\n")
## - Only declare significance if adjusted p < 0.05
Recommendation: Always adjust for multiple comparisons when testing multiple outcomes. Holm-Bonferroni strikes good balance between power and false positive control.
Problem: Original lab reports p-value but doesn’t adequately emphasize practical significance or provide clear decision framework.
cat("=== PRACTICAL SIGNIFICANCE ANALYSIS ===\n\n")
## === PRACTICAL SIGNIFICANCE ANALYSIS ===
# Calculate effect sizes with confidence intervals
revenue_effect <- cohen.d(revenue_display, revenue_no_display,
hedges.correction = TRUE)
ctr_effect <- cohen.d(ctr_display, ctr_no_display,
hedges.correction = TRUE)
cat("Revenue Effect Size:\n")
## Revenue Effect Size:
print(revenue_effect)
##
## Hedges's g
##
## g estimate: 0.6321447 (medium)
## 95 percent confidence interval:
## lower upper
## -0.01078621 1.27507570
cat("\n")
cat("CTR Effect Size:\n")
## CTR Effect Size:
print(ctr_effect)
##
## Hedges's g
##
## g estimate: 6.833318 (large)
## 95 percent confidence interval:
## lower upper
## 5.192707 8.473929
cat("\n")
# Minimum detectable difference for business relevance
min_revenue_diff <- 100 # $100/day to justify cost
observed_revenue_diff <- mean(revenue_display) - mean(revenue_no_display)
# Calculate confidence interval for difference
revenue_ci <- t_test_revenue$conf.int
cat("=== BUSINESS DECISION FRAMEWORK ===\n\n")
## === BUSINESS DECISION FRAMEWORK ===
cat("Minimum revenue increase needed:", dollar(min_revenue_diff), "/day\n")
## Minimum revenue increase needed: $100 /day
cat("Observed revenue increase:", dollar(observed_revenue_diff), "/day\n")
## Observed revenue increase: $93.59 /day
cat("95% CI for difference:", dollar(revenue_ci[1]), "to",
dollar(revenue_ci[2]), "\n\n")
## 95% CI for difference: -$186.51 to -$0.67
if(revenue_ci[1] > min_revenue_diff) {
cat("✓ RECOMMENDATION: ADOPT display ads\n")
cat(" Lower bound of CI exceeds minimum threshold\n")
} else if(revenue_ci[2] < 0) {
cat("✗ RECOMMENDATION: REJECT display ads\n")
cat(" Display ads likely decrease revenue\n")
} else {
cat("⚠ RECOMMENDATION: CONTINUE TESTING\n")
cat(" Evidence inconclusive - CI includes both positive and negative effects\n")
}
## ✗ RECOMMENDATION: REJECT display ads
## Display ads likely decrease revenue
# Effect size interpretation
cat("\n=== EFFECT SIZE INTERPRETATION ===\n")
##
## === EFFECT SIZE INTERPRETATION ===
cat("Cohen's d for revenue:", round(revenue_effect$estimate, 3), "\n")
## Cohen's d for revenue: 0.632
cat(" Interpretation: ")
## Interpretation:
if(abs(revenue_effect$estimate) < 0.2) {
cat("Negligible effect\n")
} else if(abs(revenue_effect$estimate) < 0.5) {
cat("Small effect\n")
} else if(abs(revenue_effect$estimate) < 0.8) {
cat("Medium effect\n")
} else {
cat("Large effect\n")
}
## Medium effect
Key Improvement: Combines statistical significance with practical significance, providing clear business recommendation based on confidence intervals relative to minimum detectable effect.
Problem: Original lab calculates required sample size but doesn’t assess achieved power with current data or conduct sensitivity analysis.
cat("=== POST-HOC POWER ANALYSIS ===\n\n")
## === POST-HOC POWER ANALYSIS ===
# Achieved power with current sample
n1 <- sum(marketing$display == 1)
n2 <- sum(marketing$display == 0)
observed_d <- revenue_effect$estimate
# Calculate achieved power
power_achieved <- pwr.t2n.test(n1 = n1, n2 = n2,
d = observed_d,
sig.level = 0.05,
alternative = "two.sided")
cat("Current sample sizes: n1 =", n1, ", n2 =", n2, "\n")
## Current sample sizes: n1 = 20 , n2 = 20
cat("Observed effect size (d):", round(observed_d, 3), "\n")
## Observed effect size (d): 0.632
cat("Achieved statistical power:", round(power_achieved$power, 3), "\n\n")
## Achieved statistical power: 0.495
if(power_achieved$power < 0.80) {
cat("⚠ WARNING: Power < 0.80. Risk of Type II error is high.\n")
cat(" May need more data to detect this effect reliably.\n\n")
}
## ⚠ WARNING: Power < 0.80. Risk of Type II error is high.
## May need more data to detect this effect reliably.
# Sensitivity analysis: What effect size can we detect?
power_curve <- pwr.t2n.test(n1 = n1, n2 = n2,
power = 0.80,
sig.level = 0.05,
alternative = "two.sided")
cat("=== SENSITIVITY ANALYSIS ===\n\n")
## === SENSITIVITY ANALYSIS ===
cat("With current sample sizes and 80% power:\n")
## With current sample sizes and 80% power:
cat(" Minimum detectable effect size (d):", round(power_curve$d, 3), "\n")
## Minimum detectable effect size (d): 0.909
cat(" In practical terms: Can detect differences of\n")
## In practical terms: Can detect differences of
cat(" ", round(power_curve$d * sd(marketing$revenue)), "dollars or more\n\n")
## 137 dollars or more
# Power curve visualization
d_range <- seq(0, 1, 0.05)
power_values <- sapply(d_range, function(d) {
pwr.t2n.test(n1 = n1, n2 = n2, d = d, sig.level = 0.05,
alternative = "two.sided")$power
})
ggplot(data.frame(effect_size = d_range, power = power_values),
aes(x = effect_size, y = power)) +
geom_line(color = "#3498db", size = 1.2) +
geom_hline(yintercept = 0.80, linetype = "dashed", color = "red") +
geom_vline(xintercept = observed_d, linetype = "dashed", color = "green") +
annotate("text", x = 0.5, y = 0.85,
label = "80% Power Threshold", color = "red") +
annotate("text", x = observed_d + 0.1, y = 0.5,
label = paste0("Observed d = ", round(observed_d, 2)),
color = "green") +
labs(title = "Statistical Power Curve",
subtitle = paste0("Current n1 = ", n1, ", n2 = ", n2),
x = "Effect Size (Cohen's d)",
y = "Statistical Power") +
theme_minimal()
Key Improvement: Assesses whether current study has adequate power to detect observed effect, and shows minimum detectable effect size given sample size constraints.
Beyond statistical validity, this A/B test raises important ethical questions about fairness, transparency, and potential harms. Drawing from Week 5 lessons on ethics and epistemology, we examine three critical concerns.
Epistemic Problem: The lab assumes days are randomly assigned to display vs. non-display conditions, but does not verify randomization. If assignment was based on convenience, time of month, or experimenter choice, hidden confounders could bias results.
Potential Biases: - Temporal confounding: If display ads tested during high-traffic periods (e.g., beginning of month when people have disposable income), the effect attributed to display ads might actually be due to timing - Selection on observables: Clients with larger budgets might preferentially receive display ads - Awareness effect: If account managers know which clients have display ads, they might provide extra attention or optimization
Clients receiving non-display ads: If display ads are systematically given to “better” time periods or clients, standard ad clients are disadvantaged—they don’t just miss out on display features, they’re also relegated to worse experimental conditions.
Business impact: If confounding is present, the agency might: - Overestimate display ad effectiveness → overcharge clients - Underestimate effectiveness → miss revenue opportunities
Principle of Justice (Rawls): Random assignment ensures no group is systematically advantaged or disadvantaged. Non-random assignment violates fairness.
Transparency Principle: Clients should be informed if they’re in an experiment. Were they told some received display ads and others didn’t? Did they consent?
cat("=== RANDOMIZATION CHECK ===\n\n")
## === RANDOMIZATION CHECK ===
# Check if assignment is balanced over time
marketing <- marketing |>
mutate(
week = ceiling(day_index / 7),
time_period = case_when(
day_index <= 33 ~ "Early",
day_index <= 66 ~ "Middle",
TRUE ~ "Late"
)
)
# Balance by time period
balance_table <- marketing |>
group_by(time_period, display) |>
summarise(n = n(), .groups = 'drop') |>
pivot_wider(names_from = display, values_from = n,
names_prefix = "display_")
kable(balance_table,
col.names = c("Time Period", "Non-Display Days", "Display Days"),
caption = "Temporal Balance Check")
| Time Period | Non-Display Days | Display Days |
|---|---|---|
| Early | 20 | 13 |
| Middle | 7 |
# Chi-square test for independence
balance_test <- chisq.test(table(marketing$time_period, marketing$display))
cat("\nChi-square test for temporal balance:\n")
##
## Chi-square test for temporal balance:
cat(" p-value:", balance_test$p.value, "\n")
## p-value: 0.01253369
if(balance_test$p.value < 0.05) {
cat(" ⚠ WARNING: Assignment not balanced across time periods\n")
cat(" This suggests potential confounding\n")
} else {
cat(" ✓ Assignment appears balanced across time\n")
}
## ⚠ WARNING: Assignment not balanced across time periods
## This suggests potential confounding
Recommendations: 1. Retrospectively check for confounders (done above) 2. Future experiments: Use proper randomization (coin flip, random number generator) 3. Stratified randomization: Balance across known confounders (day of week, time of month) 4. Report randomization method in all analyses
Epistemic Problem: The lab tests two outcomes (revenue and CTR) without pre-registration. This opens the door to: - HARKing: Hypothesizing After Results are Known - P-hacking: Testing multiple outcomes until one shows significance - Selective reporting: Only reporting the “significant” result
Example scenario: 1. Agency runs test, finds display ads don’t increase revenue (p = 0.35) 2. Checks CTR instead—finds marginal significance (p = 0.048) 3. Reports to clients: “Display ads significantly improve engagement!” (true but misleading) 4. Omits revenue findings because “revenue wasn’t the primary outcome” (post-hoc justification)
Clients: Make budget decisions based on misleading “significant” results. Might overpay for display ads that don’t actually improve bottom-line revenue.
Competitors: Agencies that conduct rigorous pre-registered studies are disadvantaged if cherry-picking is industry norm.
Public trust: Erodes confidence in A/B testing and data science generally if “significant results” are achieved through methodological flexibility.
Scientific Integrity: Results should be reported regardless of statistical significance. Negative results are information too.
Transparency: Pre-registration of hypotheses and analysis plan prevents post-hoc storytelling.
cat("=== PRE-REGISTRATION TEMPLATE ===\n\n")
## === PRE-REGISTRATION TEMPLATE ===
cat("Before collecting data, specify:\n\n")
## Before collecting data, specify:
cat("1. PRIMARY OUTCOME: Revenue per day\n")
## 1. PRIMARY OUTCOME: Revenue per day
cat(" - Hypothesis: Display ads increase revenue by ≥$100/day\n")
## - Hypothesis: Display ads increase revenue by ≥$100/day
cat(" - Statistical test: Two-sample t-test (or Welch if variances unequal)\n")
## - Statistical test: Two-sample t-test (or Welch if variances unequal)
cat(" - Significance threshold: α = 0.05 (Bonferroni-adjusted if multiple tests)\n")
## - Significance threshold: α = 0.05 (Bonferroni-adjusted if multiple tests)
cat(" - Minimum sample size: n = 34 per group (based on power analysis)\n\n")
## - Minimum sample size: n = 34 per group (based on power analysis)
cat("2. SECONDARY OUTCOME: Click-through rate\n")
## 2. SECONDARY OUTCOME: Click-through rate
cat(" - Hypothesis: Display ads increase CTR by ≥0.5 percentage points\n")
## - Hypothesis: Display ads increase CTR by ≥0.5 percentage points
cat(" - Statistical test: Two-proportion z-test\n")
## - Statistical test: Two-proportion z-test
cat(" - Significance threshold: α = 0.025 (Bonferroni adjustment for 2 tests)\n\n")
## - Significance threshold: α = 0.025 (Bonferroni adjustment for 2 tests)
cat("3. ANALYSIS PLAN:\n")
## 3. ANALYSIS PLAN:
cat(" - Check assumptions (normality, equal variance)\n")
## - Check assumptions (normality, equal variance)
cat(" - If violated: Use non-parametric tests\n")
## - If violated: Use non-parametric tests
cat(" - Report effect sizes (Cohen's d) regardless of p-value\n")
## - Report effect sizes (Cohen's d) regardless of p-value
cat(" - Report confidence intervals\n")
## - Report confidence intervals
cat(" - Adjust for multiple comparisons\n\n")
## - Adjust for multiple comparisons
cat("4. STOPPING RULE:\n")
## 4. STOPPING RULE:
cat(" - Collect data for exactly 100 days (50 display, 50 non-display)\n")
## - Collect data for exactly 100 days (50 display, 50 non-display)
cat(" - NO interim analyses (to prevent p-value peeking)\n")
## - NO interim analyses (to prevent p-value peeking)
cat(" - Calculate p-value only once at end\n\n")
## - Calculate p-value only once at end
cat("By pre-registering, we commit to this plan BEFORE seeing results.\n")
## By pre-registering, we commit to this plan BEFORE seeing results.
Recommendations: 1. Pre-register analysis plans on Open Science Framework or AsPredicted.org 2. Report all outcomes tested, not just significant ones 3. Distinguish pre-planned vs. exploratory analyses in reports 4. Replicate significant findings in new sample before making business recommendations
Epistemic Problem: The lab emphasizes Type I error (false positive) but underestimates risk of Type II error (false negative). With achieved power potentially <80%, there’s high risk of missing a real effect.
Business scenario: - True effect: Display ads increase revenue by $150/day - Study result: p = 0.08 (not significant at α = 0.05) - Decision: Reject display ads, continue with non-display - Harm: Agency and clients miss out on $150/day × 365 days × 50 clients = $2.7M annual opportunity cost
Clients: Lose potential revenue because study lacked power to detect moderate effect
Small businesses: Large agencies can afford longer experiments to achieve adequate power; small agencies with fewer clients have structural disadvantage
Agency: Loses competitive edge if competitors detect effect and capitalize while agency concludes “no effect”
Harm Minimization: Both Type I and Type II errors cause harm. Balancing them requires considering asymmetric costs.
Economic Justice: If display ads truly help small businesses compete, failing to detect this effect due to underpowered study perpetuates disadvantage.
cat("=== COST-BENEFIT DECISION MATRIX ===\n\n")
## === COST-BENEFIT DECISION MATRIX ===
# Define costs
display_ad_cost <- 3 # $3 per display ad
avg_ads_per_day <- 100
daily_cost_increase <- display_ad_cost * avg_ads_per_day
cat("Cost of switching to display ads:", dollar(daily_cost_increase), "/day per client\n\n")
## Cost of switching to display ads: $300 /day per client
# Decision matrix
decisions <- data.frame(
Decision = c("Adopt Display", "Adopt Display",
"Reject Display", "Reject Display"),
True_State = c("Display works (+$150/day)", "Display doesn't work (no effect)",
"Display works (+$150/day)", "Display doesn't work (no effect)"),
Outcome = c("Correct - Net gain $150 - $300 = -$150/day (LOSS!)",
"Type I Error - Lose $300/day on unnecessary display",
"Type II Error - Lose $150/day opportunity",
"Correct - No change"),
Annual_Impact = c(dollar(-150 * 365),
dollar(-300 * 365),
dollar(-150 * 365),
"$0")
)
kable(decisions,
caption = "Decision Matrix: Economic Consequences of Errors")
| Decision | True_State | Outcome | Annual_Impact |
|---|---|---|---|
| Adopt Display | Display works (+$150/day) | Correct - Net gain $150 - $300 = -$150/day (LOSS!) | -$54,750 |
| Adopt Display | Display doesn’t work (no effect) | Type I Error - Lose $300/day on unnecessary display | -$109,500 |
| Reject Display | Display works (+$150/day) | Type II Error - Lose $150/day opportunity | -$54,750 |
| Reject Display | Display doesn’t work (no effect) | Correct - No change | $0 |
cat("\n\n=== KEY INSIGHT ===\n")
##
##
## === KEY INSIGHT ===
cat("If display ad cost ($300/day) exceeds benefit ($150/day):\n")
## If display ad cost ($300/day) exceeds benefit ($150/day):
cat(" → Type I error (false positive) is MORE costly than Type II error\n")
## → Type I error (false positive) is MORE costly than Type II error
cat(" → Should require stronger evidence (lower α) before adopting\n\n")
## → Should require stronger evidence (lower α) before adopting
cat("If benefit ($150/day) exceeds cost ($50/day):\n")
## If benefit ($150/day) exceeds cost ($50/day):
cat(" → Type II error (false negative) is MORE costly than Type I error\n")
## → Type II error (false negative) is MORE costly than Type I error
cat(" → Should tolerate higher α or prioritize power\n\n")
## → Should tolerate higher α or prioritize power
cat("Decision depends on cost structure, not arbitrary α = 0.05 convention!\n")
## Decision depends on cost structure, not arbitrary α = 0.05 convention!
Recommendations: 1. Asymmetric testing: If Type II error is costlier, use α = 0.10 instead of 0.05 to increase power 2. Bayesian decision theory: Incorporate cost of errors directly into decision threshold 3. Sequential testing: Use methods that allow early stopping for benefit OR harm (e.g., Bayesian adaptive designs) 4. Pilot + Confirmation: Run small pilot (n=20) to estimate effect size, then power main study appropriately
Issue: Were clients told they’re in an experiment? Did they consent to potentially receiving inferior ads?
Harm: Clients might feel deceived if they later learn they were unknowingly in control group.
Mitigation: - Disclose A/B testing in client contracts - Offer post-experiment compensation if one group significantly outperformed
Issue: Results based on one agency’s clients may not generalize to: - Different industries - Different geographic regions - Different budget levels
Harm: Overgeneralization could lead other agencies to adopt/reject display ads inappropriately.
Mitigation: - Clearly state limitations in reports - Encourage replication across contexts - Report client characteristics (industry distribution, budget ranges)
Critical fixes: 1. ✓ Test assumptions before parametric tests 2. ✓ Address temporal dependence 3. ✓ Adjust for multiple comparisons 4. ✓ Report practical significance with confidence intervals 5. ✓ Conduct post-hoc power analysis
Result: More robust conclusions that account for real-world complexity.
Critical practices: 1. ✓ Verify randomization and check for confounders 2. ✓ Pre-register analysis plans 3. ✓ Report all outcomes (not just significant ones) 4. ✓ Consider asymmetric costs of Type I vs. Type II errors 5. ✓ Disclose limitations and context-dependence
Result: Transparent, reproducible research that builds trust with clients and stakeholders.
Based on improved analyses:
If p < 0.05 (adjusted) AND effect size d ≥ 0.5 AND CI lower bound > $100: → Adopt display ads with high confidence
If p > 0.10 OR effect size d < 0.2: → Reject display ads and stick with standard ads
Otherwise (marginal evidence): → Continue testing with larger sample, then re-evaluate
Always report: - Both statistical AND practical significance - Confidence intervals - Effect sizes - Power achieved - Limitations and assumptions
Epistemological: - Statistics alone can’t answer “Should we adopt this?” without business context - P-values measure evidence strength, not decision utility - All analyses rest on assumptions—verify them or report violations
Ethical: - Fairness requires proper randomization - Transparency requires pre-registration and full reporting - Harm minimization requires balancing Type I and Type II errors based on real costs - Clients deserve informed consent and honest uncertainty quantification
Practical: - Perfect studies are impossible; good studies acknowledge imperfections - “More data” isn’t always the answer—better design is - Statistical significance ≠ practical significance ≠ business decision
This critique demonstrates that rigorous data science requires both technical proficiency and ethical reflection.