Goal 1: Business Scenario

Context

Digital Marketing Agency Scenario: A mid-sized digital marketing agency, “DataDriven Marketing Solutions,” manages online advertising campaigns for 50+ e-commerce clients. The agency is testing whether adding display creative elements (enhanced visuals, animations, call-to-action buttons) to standard text ads increases campaign performance for their clients.

Customer or Audience

Primary Stakeholders: - Account Managers: Need to recommend optimal ad strategies to clients - Client Business Owners: Deciding whether to allocate budget to premium display ads (cost +$2-5 per ad) - Media Buyers: Executing campaigns and optimizing spend

Secondary Stakeholders: - Creative Team: Designing display elements - Data Analytics Team: Monitoring campaign performance - Finance Team: ROI analysis and budget allocation

Problem Statement (SMART)

Current Situation: DataDriven Marketing Solutions runs both standard text ads and enhanced display ads for clients, but lacks rigorous evidence on whether the additional cost of display ads (+$2-5 per ad) generates sufficient return on investment through increased revenue or click-through rates.

Problem: By Q2 2024, the agency must determine whether display ads generate statistically significant and practically meaningful improvements in campaign metrics (revenue per day, click-through rate) to justify recommending them to clients and to inform pricing strategy for premium ad packages.

Specific Measurables: - Minimum revenue increase to justify display ad cost: +$100/day per client - Target click-through rate (CTR) improvement: +0.5 percentage points - Timeline: Analysis based on 100 days of A/B test data - Success metric: Statistical significance (p < 0.05) AND practical significance (effect size d ≥ 0.5)

Scope

Variables from Week 7 Lab:

From the marketing dataset: - display (binary): 0 = standard ad, 1 = display ad - revenue (continuous): Daily revenue generated (in dollars) - clicks (count): Number of ad clicks - impressions (count): Number of times ad was shown - CTR (derived): Click-through rate = clicks / impressions

Analyses Used: 1. Two-sample t-test for revenue difference 2. Proportion test for CTR difference 3. Effect size calculation (Cohen’s d) 4. Power analysis for sample size determination 5. Bootstrapping for confidence intervals

Assumptions: - Independence: Each day’s performance is independent (no carryover effects) - Random assignment: Days are randomly assigned to display vs. non-display (questionable - see critique) - Normal distribution: Revenue differences approximately normal (needs verification) - Equal variance: Display and non-display groups have similar variance (testable) - No confounders: No seasonal effects, competitor activity, or external events (strong assumption - see ethical concerns)

Objective

Success Criteria (Past Tense):

By the end of this analysis, the following will have been achieved:

Revenue Impact Quantified: The average daily revenue difference between display and non-display ads will have been determined with 95% confidence intervals, and practical significance (effect size) will have been assessed.
CTR Impact Identified: Whether display ads increase click-through rate by a meaningful margin (≥0.5 percentage points) will have been established with statistical evidence.
Cost-Benefit Analysis Completed: The minimum revenue increase needed to justify display ad costs ($100/day) will have been compared to the observed effect, enabling a clear recommendation.
Sample Size Validated: Whether the collected data (100 days) provides sufficient statistical power (≥80%) to detect meaningful effects will have been verified.
Recommendation Delivered: A data-driven recommendation (adopt display ads, reject them, or continue testing) will have been provided to account managers with clear rationale and confidence levels.

Goal 2: Model Critique and Improved Analyses

Overview of Original Week 7 Analysis

The Week 7 lab conducted: 1. Bootstrapped sampling distribution for revenue difference 2. Cohen’s d effect size calculation 3. Direct simulation test with p-value 4. Sample size calculation using pwrss package 5. Normal test of equal proportions for CTR

Key findings from original lab: - Observed revenue difference: ~$93.60 - P-value: ~0.35 (not statistically significant at α = 0.05) - Effect size: Not explicitly calculated in full

Critical Issues Identified

Issue 1: Lack of Assumption Checking

Problem: The lab never verifies normality, equal variance, or independence assumptions before conducting t-tests and normal approximations.

Why it matters: Violations can lead to incorrect p-values, confidence intervals, and conclusions. For business decisions involving budget allocation, this is unacceptable.

Improved Analysis 1: Comprehensive Assumption Testing

# Load marketing data
url_ <- "https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/marketing/marketing.csv"
marketing <- read_delim(url_, delim = ",", show_col_types = FALSE)

# Create CTR variable
marketing <- marketing |>
  mutate(CTR = clicks / impressions)

# Separate groups
revenue_display <- marketing |> filter(display == 1) |> pull(revenue)
revenue_no_display <- marketing |> filter(display == 0) |> pull(revenue)

cat("=== ASSUMPTION TESTS ===\n\n")

## === ASSUMPTION TESTS ===

# 1. Normality Test (Shapiro-Wilk)
cat("1. NORMALITY TEST (Shapiro-Wilk)\n")

## 1. NORMALITY TEST (Shapiro-Wilk)

cat("   Display group p-value:", 
    shapiro.test(revenue_display)$p.value, "\n")

##    Display group p-value: 0.5180526

cat("   Non-display group p-value:", 
    shapiro.test(revenue_no_display)$p.value, "\n")

##    Non-display group p-value: 0.4131173

cat("   Interpretation: p > 0.05 suggests normality assumption is met\n\n")

##    Interpretation: p > 0.05 suggests normality assumption is met

# 2. Equal Variance Test (Levene's Test via car package or F-test)
cat("2. EQUAL VARIANCE TEST (F-test)\n")

## 2. EQUAL VARIANCE TEST (F-test)

var_test <- var.test(revenue_display, revenue_no_display)
cat("   F-statistic:", var_test$statistic, "\n")

##    F-statistic: 1.214676

cat("   P-value:", var_test$p.value, "\n")

##    P-value: 0.6759531

cat("   Interpretation: p > 0.05 suggests equal variance assumption is met\n\n")

##    Interpretation: p > 0.05 suggests equal variance assumption is met

# 3. Visual Diagnostic: Q-Q Plots
par(mfrow = c(1, 2))
qqnorm(revenue_display, main = "Q-Q Plot: Display Ads")
qqline(revenue_display, col = "red")
qqnorm(revenue_no_display, main = "Q-Q Plot: Non-Display Ads")
qqline(revenue_no_display, col = "red")

par(mfrow = c(1, 1))

cat("\n3. VISUAL ASSESSMENT\n")

## 
## 3. VISUAL ASSESSMENT

cat("   Q-Q plots should show points following the red line if normal\n")

##    Q-Q plots should show points following the red line if normal

Findings:

If assumptions are violated: - Normality violated → Use non-parametric tests (Mann-Whitney U / Wilcoxon rank-sum) - Equal variance violated → Use Welch’s t-test (unequal variances) - Both violated → Bootstrap or permutation tests are more appropriate

Issue 2: Temporal Dependence Not Addressed

Problem: The lab treats each day as independent, but marketing campaigns have temporal patterns (day of week effects, seasonality, trend over time). Revenue on consecutive days might be correlated.

Why it matters: Violates independence assumption, inflating Type I error rate. Real campaigns have: - Weekend vs. weekday effects - Learning/optimization over time - External events (holidays, competitor campaigns)

Improved Analysis 2: Time Series Considerations

# Check for temporal patterns
cat("=== TEMPORAL DEPENDENCE CHECK ===\n\n")

## === TEMPORAL DEPENDENCE CHECK ===

# Add time index
marketing <- marketing |>
  mutate(day_index = row_number())

# Test for autocorrelation in residuals
lm_revenue <- lm(revenue ~ display, data = marketing)
acf(residuals(lm_revenue), main = "Autocorrelation of Residuals")

# Test for trend over time
cat("Correlation between revenue and time:\n")

## Correlation between revenue and time:

cat("  Display group:", 
    cor.test(marketing |> filter(display == 1) |> pull(day_index),
             marketing |> filter(display == 1) |> pull(revenue))$estimate, "\n")

##   Display group: 0.5625481

cat("  Non-display group:", 
    cor.test(marketing |> filter(display == 0) |> pull(day_index),
             marketing |> filter(display == 0) |> pull(revenue))$estimate, "\n\n")

##   Non-display group: 0.5639309

# Visualize revenue over time
ggplot(marketing, aes(x = day_index, y = revenue, color = factor(display))) +
  geom_line(alpha = 0.6) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE) +
  scale_color_manual(values = c("0" = "#e74c3c", "1" = "#27ae60"),
                     labels = c("Non-Display", "Display")) +
  labs(title = "Revenue Over Time by Ad Type",
       subtitle = "Checking for temporal trends and patterns",
       x = "Day Index", y = "Revenue ($)", color = "Ad Type") +
  theme_minimal()

cat("If significant trends exist, use time series methods or mixed models\n")

## If significant trends exist, use time series methods or mixed models

Recommendation: If autocorrelation detected, use: - Mixed-effects model with time random effect - Time series regression (ARIMA with intervention variable) - Block randomization analysis (compare display vs. non-display within weekly blocks)

Issue 3: Multiple Comparisons Problem

Problem: The lab tests multiple hypotheses (revenue AND CTR) without adjusting for multiple comparisons. This inflates family-wise error rate.

Why it matters: Testing 2 hypotheses at α = 0.05 each gives family-wise error rate of ~0.10 (10% chance of at least one false positive).

Improved Analysis 3: Multiple Testing Correction

cat("=== MULTIPLE TESTING CORRECTION ===\n\n")

## === MULTIPLE TESTING CORRECTION ===

# Calculate both p-values
# Revenue test
t_test_revenue <- t.test(revenue ~ display, data = marketing)
p_revenue <- t_test_revenue$p.value

# CTR test
ctr_display <- marketing |> filter(display == 1) |> pull(CTR)
ctr_no_display <- marketing |> filter(display == 0) |> pull(CTR)
t_test_ctr <- t.test(ctr_display, ctr_no_display)
p_ctr <- t_test_ctr$p.value

# Bonferroni correction (most conservative)
p_values <- c(revenue = p_revenue, CTR = p_ctr)
adjusted_p <- p.adjust(p_values, method = "bonferroni")

# Holm-Bonferroni (less conservative, more powerful)
adjusted_p_holm <- p.adjust(p_values, method = "holm")

# Results table
results_df <- data.frame(
  Test = c("Revenue Difference", "CTR Difference"),
  Raw_P_Value = p_values,
  Bonferroni_Adjusted = adjusted_p,
  Holm_Adjusted = adjusted_p_holm,
  Significant_at_0.05 = adjusted_p_holm < 0.05
)

kable(results_df, 
      col.names = c("Test", "Raw p-value", "Bonferroni", "Holm", "Significant?"),
      digits = 4,
      caption = "Multiple Testing Correction Results")

Multiple Testing Correction Results
	Test	Raw p-value	Bonferroni	Holm	Significant?
revenue	Revenue Difference	0.0485	0.0969	0.0485	TRUE
CTR	CTR Difference	0.0000	0.0000	0.0000	TRUE

cat("\nInterpretation:\n")

## 
## Interpretation:

cat("- Use adjusted p-values when testing multiple hypotheses\n")

## - Use adjusted p-values when testing multiple hypotheses

cat("- Holm method is less conservative than Bonferroni\n")

## - Holm method is less conservative than Bonferroni

cat("- Only declare significance if adjusted p < 0.05\n")

## - Only declare significance if adjusted p < 0.05

Recommendation: Always adjust for multiple comparisons when testing multiple outcomes. Holm-Bonferroni strikes good balance between power and false positive control.

Additional Improved Analyses

Improved Analysis 4: Practical Significance with Confidence Intervals

Problem: Original lab reports p-value but doesn’t adequately emphasize practical significance or provide clear decision framework.

cat("=== PRACTICAL SIGNIFICANCE ANALYSIS ===\n\n")

## === PRACTICAL SIGNIFICANCE ANALYSIS ===

# Calculate effect sizes with confidence intervals
revenue_effect <- cohen.d(revenue_display, revenue_no_display, 
                          hedges.correction = TRUE)

ctr_effect <- cohen.d(ctr_display, ctr_no_display,
                      hedges.correction = TRUE)

cat("Revenue Effect Size:\n")

## Revenue Effect Size:

print(revenue_effect)

## 
## Hedges's g
## 
## g estimate: 0.6321447 (medium)
## 95 percent confidence interval:
##       lower       upper 
## -0.01078621  1.27507570

cat("\n")

cat("CTR Effect Size:\n")

## CTR Effect Size:

print(ctr_effect)

## 
## Hedges's g
## 
## g estimate: 6.833318 (large)
## 95 percent confidence interval:
##    lower    upper 
## 5.192707 8.473929

cat("\n")

# Minimum detectable difference for business relevance
min_revenue_diff <- 100  # $100/day to justify cost
observed_revenue_diff <- mean(revenue_display) - mean(revenue_no_display)

# Calculate confidence interval for difference
revenue_ci <- t_test_revenue$conf.int

cat("=== BUSINESS DECISION FRAMEWORK ===\n\n")

## === BUSINESS DECISION FRAMEWORK ===

cat("Minimum revenue increase needed:", dollar(min_revenue_diff), "/day\n")

## Minimum revenue increase needed: $100 /day

cat("Observed revenue increase:", dollar(observed_revenue_diff), "/day\n")

## Observed revenue increase: $93.59 /day

cat("95% CI for difference:", dollar(revenue_ci[1]), "to", 
    dollar(revenue_ci[2]), "\n\n")

## 95% CI for difference: -$186.51 to -$0.67

if(revenue_ci[1] > min_revenue_diff) {
  cat("✓ RECOMMENDATION: ADOPT display ads\n")
  cat("  Lower bound of CI exceeds minimum threshold\n")
} else if(revenue_ci[2] < 0) {
  cat("✗ RECOMMENDATION: REJECT display ads\n")
  cat("  Display ads likely decrease revenue\n")
} else {
  cat("⚠ RECOMMENDATION: CONTINUE TESTING\n")
  cat("  Evidence inconclusive - CI includes both positive and negative effects\n")
}

## ✗ RECOMMENDATION: REJECT display ads
##   Display ads likely decrease revenue

# Effect size interpretation
cat("\n=== EFFECT SIZE INTERPRETATION ===\n")

## 
## === EFFECT SIZE INTERPRETATION ===

cat("Cohen's d for revenue:", round(revenue_effect$estimate, 3), "\n")

## Cohen's d for revenue: 0.632

cat("  Interpretation: ")

##   Interpretation:

if(abs(revenue_effect$estimate) < 0.2) {
  cat("Negligible effect\n")
} else if(abs(revenue_effect$estimate) < 0.5) {
  cat("Small effect\n")
} else if(abs(revenue_effect$estimate) < 0.8) {
  cat("Medium effect\n")
} else {
  cat("Large effect\n")
}

## Medium effect

Key Improvement: Combines statistical significance with practical significance, providing clear business recommendation based on confidence intervals relative to minimum detectable effect.

Improved Analysis 5: Power Analysis and Sensitivity

Problem: Original lab calculates required sample size but doesn’t assess achieved power with current data or conduct sensitivity analysis.

cat("=== POST-HOC POWER ANALYSIS ===\n\n")

## === POST-HOC POWER ANALYSIS ===

# Achieved power with current sample
n1 <- sum(marketing$display == 1)
n2 <- sum(marketing$display == 0)
observed_d <- revenue_effect$estimate

# Calculate achieved power
power_achieved <- pwr.t2n.test(n1 = n1, n2 = n2, 
                               d = observed_d, 
                               sig.level = 0.05,
                               alternative = "two.sided")

cat("Current sample sizes: n1 =", n1, ", n2 =", n2, "\n")

## Current sample sizes: n1 = 20 , n2 = 20

cat("Observed effect size (d):", round(observed_d, 3), "\n")

## Observed effect size (d): 0.632

cat("Achieved statistical power:", round(power_achieved$power, 3), "\n\n")

## Achieved statistical power: 0.495

if(power_achieved$power < 0.80) {
  cat("⚠ WARNING: Power < 0.80. Risk of Type II error is high.\n")
  cat("  May need more data to detect this effect reliably.\n\n")
}

## ⚠ WARNING: Power < 0.80. Risk of Type II error is high.
##   May need more data to detect this effect reliably.

# Sensitivity analysis: What effect size can we detect?
power_curve <- pwr.t2n.test(n1 = n1, n2 = n2,
                            power = 0.80,
                            sig.level = 0.05,
                            alternative = "two.sided")

cat("=== SENSITIVITY ANALYSIS ===\n\n")

## === SENSITIVITY ANALYSIS ===

cat("With current sample sizes and 80% power:\n")

## With current sample sizes and 80% power:

cat("  Minimum detectable effect size (d):", round(power_curve$d, 3), "\n")

##   Minimum detectable effect size (d): 0.909

cat("  In practical terms: Can detect differences of\n")

##   In practical terms: Can detect differences of

cat("    ", round(power_curve$d * sd(marketing$revenue)), "dollars or more\n\n")

##      137 dollars or more

# Power curve visualization
d_range <- seq(0, 1, 0.05)
power_values <- sapply(d_range, function(d) {
  pwr.t2n.test(n1 = n1, n2 = n2, d = d, sig.level = 0.05,
               alternative = "two.sided")$power
})

ggplot(data.frame(effect_size = d_range, power = power_values),
       aes(x = effect_size, y = power)) +
  geom_line(color = "#3498db", size = 1.2) +
  geom_hline(yintercept = 0.80, linetype = "dashed", color = "red") +
  geom_vline(xintercept = observed_d, linetype = "dashed", color = "green") +
  annotate("text", x = 0.5, y = 0.85, 
           label = "80% Power Threshold", color = "red") +
  annotate("text", x = observed_d + 0.1, y = 0.5,
           label = paste0("Observed d = ", round(observed_d, 2)),
           color = "green") +
  labs(title = "Statistical Power Curve",
       subtitle = paste0("Current n1 = ", n1, ", n2 = ", n2),
       x = "Effect Size (Cohen's d)",
       y = "Statistical Power") +
  theme_minimal()

Key Improvement: Assesses whether current study has adequate power to detect observed effect, and shows minimum detectable effect size given sample size constraints.

Goal 3: Ethical and Epistemological Concerns

Overview

Beyond statistical validity, this A/B test raises important ethical questions about fairness, transparency, and potential harms. Drawing from Week 5 lessons on ethics and epistemology, we examine three critical concerns.

Concern 1: Selection Bias in “Random” Assignment

The Issue

Epistemic Problem: The lab assumes days are randomly assigned to display vs. non-display conditions, but does not verify randomization. If assignment was based on convenience, time of month, or experimenter choice, hidden confounders could bias results.

Potential Biases: - Temporal confounding: If display ads tested during high-traffic periods (e.g., beginning of month when people have disposable income), the effect attributed to display ads might actually be due to timing - Selection on observables: Clients with larger budgets might preferentially receive display ads - Awareness effect: If account managers know which clients have display ads, they might provide extra attention or optimization

Who Is Affected?

Clients receiving non-display ads: If display ads are systematically given to “better” time periods or clients, standard ad clients are disadvantaged—they don’t just miss out on display features, they’re also relegated to worse experimental conditions.

Business impact: If confounding is present, the agency might: - Overestimate display ad effectiveness → overcharge clients - Underestimate effectiveness → miss revenue opportunities

Ethical Framework

Principle of Justice (Rawls): Random assignment ensures no group is systematically advantaged or disadvantaged. Non-random assignment violates fairness.

Transparency Principle: Clients should be informed if they’re in an experiment. Were they told some received display ads and others didn’t? Did they consent?

Mitigation Strategies

cat("=== RANDOMIZATION CHECK ===\n\n")

## === RANDOMIZATION CHECK ===

# Check if assignment is balanced over time
marketing <- marketing |>
  mutate(
    week = ceiling(day_index / 7),
    time_period = case_when(
      day_index <= 33 ~ "Early",
      day_index <= 66 ~ "Middle",
      TRUE ~ "Late"
    )
  )

# Balance by time period
balance_table <- marketing |>
  group_by(time_period, display) |>
  summarise(n = n(), .groups = 'drop') |>
  pivot_wider(names_from = display, values_from = n,
              names_prefix = "display_")

kable(balance_table,
      col.names = c("Time Period", "Non-Display Days", "Display Days"),
      caption = "Temporal Balance Check")

Temporal Balance Check
Time Period	Non-Display Days	Display Days
Early	20	13
Middle		7

# Chi-square test for independence
balance_test <- chisq.test(table(marketing$time_period, marketing$display))
cat("\nChi-square test for temporal balance:\n")

## 
## Chi-square test for temporal balance:

cat("  p-value:", balance_test$p.value, "\n")

##   p-value: 0.01253369

if(balance_test$p.value < 0.05) {
  cat("  ⚠ WARNING: Assignment not balanced across time periods\n")
  cat("  This suggests potential confounding\n")
} else {
  cat("  ✓ Assignment appears balanced across time\n")
}

##   ⚠ WARNING: Assignment not balanced across time periods
##   This suggests potential confounding

Recommendations: 1. Retrospectively check for confounders (done above) 2. Future experiments: Use proper randomization (coin flip, random number generator) 3. Stratified randomization: Balance across known confounders (day of week, time of month) 4. Report randomization method in all analyses

Concern 2: Cherry-Picking Metrics and HARKing

The Issue

Epistemic Problem: The lab tests two outcomes (revenue and CTR) without pre-registration. This opens the door to: - HARKing: Hypothesizing After Results are Known - P-hacking: Testing multiple outcomes until one shows significance - Selective reporting: Only reporting the “significant” result

Example scenario: 1. Agency runs test, finds display ads don’t increase revenue (p = 0.35) 2. Checks CTR instead—finds marginal significance (p = 0.048) 3. Reports to clients: “Display ads significantly improve engagement!” (true but misleading) 4. Omits revenue findings because “revenue wasn’t the primary outcome” (post-hoc justification)

Who Is Affected?

Clients: Make budget decisions based on misleading “significant” results. Might overpay for display ads that don’t actually improve bottom-line revenue.

Competitors: Agencies that conduct rigorous pre-registered studies are disadvantaged if cherry-picking is industry norm.

Public trust: Erodes confidence in A/B testing and data science generally if “significant results” are achieved through methodological flexibility.

Ethical Framework

Scientific Integrity: Results should be reported regardless of statistical significance. Negative results are information too.

Transparency: Pre-registration of hypotheses and analysis plan prevents post-hoc storytelling.

Mitigation Strategies

cat("=== PRE-REGISTRATION TEMPLATE ===\n\n")

## === PRE-REGISTRATION TEMPLATE ===

cat("Before collecting data, specify:\n\n")

## Before collecting data, specify:

cat("1. PRIMARY OUTCOME: Revenue per day\n")

## 1. PRIMARY OUTCOME: Revenue per day

cat("   - Hypothesis: Display ads increase revenue by ≥$100/day\n")

##    - Hypothesis: Display ads increase revenue by ≥$100/day

cat("   - Statistical test: Two-sample t-test (or Welch if variances unequal)\n")

##    - Statistical test: Two-sample t-test (or Welch if variances unequal)

cat("   - Significance threshold: α = 0.05 (Bonferroni-adjusted if multiple tests)\n")

##    - Significance threshold: α = 0.05 (Bonferroni-adjusted if multiple tests)

cat("   - Minimum sample size: n = 34 per group (based on power analysis)\n\n")

##    - Minimum sample size: n = 34 per group (based on power analysis)

cat("2. SECONDARY OUTCOME: Click-through rate\n")

## 2. SECONDARY OUTCOME: Click-through rate

cat("   - Hypothesis: Display ads increase CTR by ≥0.5 percentage points\n")

##    - Hypothesis: Display ads increase CTR by ≥0.5 percentage points

cat("   - Statistical test: Two-proportion z-test\n")

##    - Statistical test: Two-proportion z-test

cat("   - Significance threshold: α = 0.025 (Bonferroni adjustment for 2 tests)\n\n")

##    - Significance threshold: α = 0.025 (Bonferroni adjustment for 2 tests)

cat("3. ANALYSIS PLAN:\n")

## 3. ANALYSIS PLAN:

cat("   - Check assumptions (normality, equal variance)\n")

##    - Check assumptions (normality, equal variance)

cat("   - If violated: Use non-parametric tests\n")

##    - If violated: Use non-parametric tests

cat("   - Report effect sizes (Cohen's d) regardless of p-value\n")

##    - Report effect sizes (Cohen's d) regardless of p-value

cat("   - Report confidence intervals\n")

##    - Report confidence intervals

cat("   - Adjust for multiple comparisons\n\n")

##    - Adjust for multiple comparisons

cat("4. STOPPING RULE:\n")

## 4. STOPPING RULE:

cat("   - Collect data for exactly 100 days (50 display, 50 non-display)\n")

##    - Collect data for exactly 100 days (50 display, 50 non-display)

cat("   - NO interim analyses (to prevent p-value peeking)\n")

##    - NO interim analyses (to prevent p-value peeking)

cat("   - Calculate p-value only once at end\n\n")

##    - Calculate p-value only once at end

cat("By pre-registering, we commit to this plan BEFORE seeing results.\n")

## By pre-registering, we commit to this plan BEFORE seeing results.

Recommendations: 1. Pre-register analysis plans on Open Science Framework or AsPredicted.org 2. Report all outcomes tested, not just significant ones 3. Distinguish pre-planned vs. exploratory analyses in reports 4. Replicate significant findings in new sample before making business recommendations

Concern 3: Economic Harm from Type II Errors

The Issue

Epistemic Problem: The lab emphasizes Type I error (false positive) but underestimates risk of Type II error (false negative). With achieved power potentially <80%, there’s high risk of missing a real effect.

Business scenario: - True effect: Display ads increase revenue by $150/day - Study result: p = 0.08 (not significant at α = 0.05) - Decision: Reject display ads, continue with non-display - Harm: Agency and clients miss out on $150/day × 365 days × 50 clients = $2.7M annual opportunity cost

Who Is Affected?

Clients: Lose potential revenue because study lacked power to detect moderate effect

Small businesses: Large agencies can afford longer experiments to achieve adequate power; small agencies with fewer clients have structural disadvantage

Agency: Loses competitive edge if competitors detect effect and capitalize while agency concludes “no effect”

Ethical Framework

Harm Minimization: Both Type I and Type II errors cause harm. Balancing them requires considering asymmetric costs.

Economic Justice: If display ads truly help small businesses compete, failing to detect this effect due to underpowered study perpetuates disadvantage.

Cost-Benefit Analysis

cat("=== COST-BENEFIT DECISION MATRIX ===\n\n")

## === COST-BENEFIT DECISION MATRIX ===

# Define costs
display_ad_cost <- 3  # $3 per display ad
avg_ads_per_day <- 100
daily_cost_increase <- display_ad_cost * avg_ads_per_day

cat("Cost of switching to display ads:", dollar(daily_cost_increase), "/day per client\n\n")

## Cost of switching to display ads: $300 /day per client

# Decision matrix
decisions <- data.frame(
  Decision = c("Adopt Display", "Adopt Display", 
               "Reject Display", "Reject Display"),
  True_State = c("Display works (+$150/day)", "Display doesn't work (no effect)",
                 "Display works (+$150/day)", "Display doesn't work (no effect)"),
  Outcome = c("Correct - Net gain $150 - $300 = -$150/day (LOSS!)", 
              "Type I Error - Lose $300/day on unnecessary display",
              "Type II Error - Lose $150/day opportunity",
              "Correct - No change"),
  Annual_Impact = c(dollar(-150 * 365), 
                   dollar(-300 * 365),
                   dollar(-150 * 365),
                   "$0")
)

kable(decisions,
      caption = "Decision Matrix: Economic Consequences of Errors")

Decision Matrix: Economic Consequences of Errors
Decision	True_State	Outcome	Annual_Impact
Adopt Display	Display works (+$150/day)	Correct - Net gain $150 - $300 = -$150/day (LOSS!)	-$54,750
Adopt Display	Display doesn’t work (no effect)	Type I Error - Lose $300/day on unnecessary display	-$109,500
Reject Display	Display works (+$150/day)	Type II Error - Lose $150/day opportunity	-$54,750
Reject Display	Display doesn’t work (no effect)	Correct - No change	$0

cat("\n\n=== KEY INSIGHT ===\n")

## 
## 
## === KEY INSIGHT ===

cat("If display ad cost ($300/day) exceeds benefit ($150/day):\n")

## If display ad cost ($300/day) exceeds benefit ($150/day):

cat("  → Type I error (false positive) is MORE costly than Type II error\n")

##   → Type I error (false positive) is MORE costly than Type II error

cat("  → Should require stronger evidence (lower α) before adopting\n\n")

##   → Should require stronger evidence (lower α) before adopting

cat("If benefit ($150/day) exceeds cost ($50/day):\n")

## If benefit ($150/day) exceeds cost ($50/day):

cat("  → Type II error (false negative) is MORE costly than Type I error\n")

##   → Type II error (false negative) is MORE costly than Type I error

cat("  → Should tolerate higher α or prioritize power\n\n")

##   → Should tolerate higher α or prioritize power

cat("Decision depends on cost structure, not arbitrary α = 0.05 convention!\n")

## Decision depends on cost structure, not arbitrary α = 0.05 convention!

Recommendations: 1. Asymmetric testing: If Type II error is costlier, use α = 0.10 instead of 0.05 to increase power 2. Bayesian decision theory: Incorporate cost of errors directly into decision threshold 3. Sequential testing: Use methods that allow early stopping for benefit OR harm (e.g., Bayesian adaptive designs) 4. Pilot + Confirmation: Run small pilot (n=20) to estimate effect size, then power main study appropriately

Additional Ethical Considerations

5. External Validity

Issue: Results based on one agency’s clients may not generalize to: - Different industries - Different geographic regions - Different budget levels

Harm: Overgeneralization could lead other agencies to adopt/reject display ads inappropriately.

Mitigation: - Clearly state limitations in reports - Encourage replication across contexts - Report client characteristics (industry distribution, budget ranges)

Summary and Recommendations

Statistical Improvements

Critical fixes: 1. ✓ Test assumptions before parametric tests 2. ✓ Address temporal dependence 3. ✓ Adjust for multiple comparisons 4. ✓ Report practical significance with confidence intervals 5. ✓ Conduct post-hoc power analysis

Result: More robust conclusions that account for real-world complexity.

Ethical Safeguards

Critical practices: 1. ✓ Verify randomization and check for confounders 2. ✓ Pre-register analysis plans 3. ✓ Report all outcomes (not just significant ones) 4. ✓ Consider asymmetric costs of Type I vs. Type II errors 5. ✓ Disclose limitations and context-dependence

Result: Transparent, reproducible research that builds trust with clients and stakeholders.

Final Recommendation for Business Scenario

Based on improved analyses:

If p < 0.05 (adjusted) AND effect size d ≥ 0.5 AND CI lower bound > $100: → Adopt display ads with high confidence

If p > 0.10 OR effect size d < 0.2: → Reject display ads and stick with standard ads

Otherwise (marginal evidence): → Continue testing with larger sample, then re-evaluate

Always report: - Both statistical AND practical significance - Confidence intervals - Effect sizes - Power achieved - Limitations and assumptions

Broader Lessons

Epistemological: - Statistics alone can’t answer “Should we adopt this?” without business context - P-values measure evidence strength, not decision utility - All analyses rest on assumptions—verify them or report violations

Ethical: - Fairness requires proper randomization - Transparency requires pre-registration and full reporting - Harm minimization requires balancing Type I and Type II errors based on real costs - Clients deserve informed consent and honest uncertainty quantification

Practical: - Perfect studies are impossible; good studies acknowledge imperfections - “More data” isn’t always the answer—better design is - Statistical significance ≠ practical significance ≠ business decision

This critique demonstrates that rigorous data science requires both technical proficiency and ethical reflection.

Week 14 Data Dive: Model Critique of Week 7 (Hypothesis Testing)

Pratik Mane, Anuska Acharya, Mohid Qadeer

2026-04-21

Goal 1: Business Scenario

Context

Customer or Audience

Problem Statement (SMART)

Scope

Objective

Goal 2: Model Critique and Improved Analyses

Overview of Original Week 7 Analysis

Critical Issues Identified

Issue 1: Lack of Assumption Checking

Issue 2: Temporal Dependence Not Addressed

Issue 3: Multiple Comparisons Problem

Additional Improved Analyses

Improved Analysis 4: Practical Significance with Confidence Intervals

Improved Analysis 5: Power Analysis and Sensitivity

Goal 3: Ethical and Epistemological Concerns

Overview

Concern 1: Selection Bias in “Random” Assignment

The Issue

Who Is Affected?

Ethical Framework

Mitigation Strategies

Concern 2: Cherry-Picking Metrics and HARKing

The Issue

Who Is Affected?

Ethical Framework

Mitigation Strategies

Concern 3: Economic Harm from Type II Errors

The Issue

Who Is Affected?

Ethical Framework

Cost-Benefit Analysis

Additional Ethical Considerations

4. Informed Consent

5. External Validity

Summary and Recommendations

Statistical Improvements

Ethical Safeguards

Final Recommendation for Business Scenario

Broader Lessons