A compact A/B testing notebook — EDA + Core Tests

This notebook contains a concise, practical workflow I used to evaluate a marketing A/B test. It focuses on the minimal, statistically sound steps needed to decide whether a campaign moved the primary KPI (conversion rate). I include brief notes and in-place interpretations to make the analysis easy to follow.

Assumptions - Unit of analysis is the user. If your raw data has multiple events per user, aggregate first. - Expected columns: user_id, test_group (values A and B).


1) Load data

DATA_PATH <- "./marketing_AB.csv"
df_orgnl <- readr::read_csv(DATA_PATH, show_col_types = FALSE)

# display head
print(df_orgnl)
## # A tibble: 588,101 × 7
##     ...1 `user id` `test group` converted `total ads` `most ads day`
##    <dbl>     <dbl> <chr>        <lgl>           <dbl> <chr>         
##  1     0   1069124 ad           FALSE             130 Monday        
##  2     1   1119715 ad           FALSE              93 Tuesday       
##  3     2   1144181 ad           FALSE              21 Tuesday       
##  4     3   1435133 ad           FALSE             355 Tuesday       
##  5     4   1015700 ad           FALSE             276 Friday        
##  6     5   1137664 ad           FALSE             734 Saturday      
##  7     6   1116205 ad           FALSE             264 Wednesday     
##  8     7   1496843 ad           FALSE              17 Sunday        
##  9     8   1448851 ad           FALSE              21 Tuesday       
## 10     9   1446284 ad           FALSE             142 Monday        
## # ℹ 588,091 more rows
## # ℹ 1 more variable: `most ads hour` <dbl>

2) Preprocess & sanity checks

Ensure a single row per user.

df <- df_orgnl
summary(df)
##       ...1           user id         test group        converted      
##  Min.   :     0   Min.   : 900000   Length:588101      Mode :logical  
##  1st Qu.:147025   1st Qu.:1143190   Class :character   FALSE:573258   
##  Median :294050   Median :1313725   Mode  :character   TRUE :14843    
##  Mean   :294050   Mean   :1310692                                     
##  3rd Qu.:441075   3rd Qu.:1484088                                     
##  Max.   :588100   Max.   :1654483                                     
##    total ads       most ads day       most ads hour  
##  Min.   :   1.00   Length:588101      Min.   : 0.00  
##  1st Qu.:   4.00   Class :character   1st Qu.:11.00  
##  Median :  13.00   Mode  :character   Median :14.00  
##  Mean   :  24.82                      Mean   :14.47  
##  3rd Qu.:  27.00                      3rd Qu.:18.00  
##  Max.   :2065.00                      Max.   :23.00
colSums(is.na(df))
##          ...1       user id    test group     converted     total ads 
##             0             0             0             0             0 
##  most ads day most ads hour 
##             0             0
# Drop index-like column if present
if ("Unnamed: 0" %in% names(df)) df <- dplyr::select(df, -`Unnamed: 0`)

# Rename columns for easier access
rename_map <- c(
  `user id` = "user_id",
  `test group` = "test_group",
  `total ads` = "total_ads",
  `most ads day` = "most_ads_day",
  `most ads hour` = "most_ads_hour"
)
for (old in names(rename_map)) {
  new <- rename_map[[old]]
  if (old %in% names(df)) names(df)[names(df) == old] <- new
}
filter_outliers <- function(x, whisker = 1.5) {
  q1 <- stats::quantile(x, 0.25, na.rm = TRUE)
  q3 <- stats::quantile(x, 0.75, na.rm = TRUE)
  iqr <- q3 - q1
  lower <- q1 - whisker * iqr
  upper <- q3 + whisker * iqr
  which(x < lower | x > upper)
}

if ("total_ads" %in% names(df) && any(!is.na(df$total_ads))) {
  idx <- filter_outliers(df$total_ads)
  total_outliers <- length(idx)
  total_records <- nrow(df)
  pct_outliers <- total_outliers * 100 / total_records

  cat(sprintf(
    "\nOutlier Summary for total_ads:\n---\nTotal outliers: %d\nTotal records: %d\nPercentage of outliers: %.2f%%\n\n",
    total_outliers, total_records, pct_outliers
  ))

  # Compute whiskers for reference lines
  q1 <- quantile(df$total_ads, 0.25, na.rm = TRUE)
  q3 <- quantile(df$total_ads, 0.75, na.rm = TRUE)
  iqr <- q3 - q1
  lower <- q1 - 1.5 * iqr
  upper <- q3 + 1.5 * iqr

  # create a plotting dataframe with an explicit outlier flag and a constant x value
  df_plot <- df %>%
    mutate(
      is_outlier = ifelse(!is.na(total_ads) & (total_ads < lower | total_ads > upper), TRUE, FALSE),
      grp = factor("all"),
      point_color = ifelse(is_outlier, "red", "black")    # concrete color values to avoid scale issues
    )

  # Create visualization and force it to render
  p <- ggplot(df_plot, aes(x = grp, y = total_ads)) +
    geom_boxplot(fill = "#9ecae1", color = "#2171b5",
                 outlier.colour = "red", outlier.shape = 16, width = 0.25) +
    # use identity scale by mapping to a concrete color column
    geom_jitter(aes(color = point_color),
                width = 0.15, alpha = 0.7, show.legend = FALSE, size = 1.5) +
    scale_color_identity() +
    geom_hline(yintercept = lower, linetype = "dashed", color = "red") +
    geom_hline(yintercept = upper, linetype = "dashed", color = "red") +
    labs(
      title = "Outlier Detection — total_ads",
      subtitle = sprintf("Detected %d outliers (%.2f%% of records)", total_outliers, pct_outliers),
      y = "Total Ads",
      x = NULL
    ) +
    theme_minimal(base_size = 13) +
    theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())

  print(p)
} else {
  cat("No 'total_ads' column found or all values are NA - skipping outlier plot.\n")
}
## 
## Outlier Summary for total_ads:
## ---
## Total outliers: 52057
## Total records: 588101
## Percentage of outliers: 8.85%

2) Conversion summary and simple plots.

# Build user-level frame
user_df <- df %>%
  group_by(user_id) %>%
  summarise(
    test_group = dplyr::first(test_group),
    converted = as.integer(max(converted, na.rm = TRUE))
  ) %>%
  ungroup()



cat("\nTotal users:", nrow(user_df), "\n")
## 
## Total users: 588101
user_df %>% group_by(test_group) %>% summarise(n_users = n(), conversions = sum(converted, na.rm = TRUE), .groups = "drop")
## # A tibble: 2 × 3
##   test_group n_users conversions
##   <chr>        <int>       <int>
## 1 ad          564577       14423
## 2 psa          23524         420
cat("\nMissing values by column:\n")
## 
## Missing values by column:
sapply(user_df, function(x) sum(is.na(x)))
##    user_id test_group  converted 
##          0          0          0
if (nrow(user_df) > 0) {
  conv_by_group <- user_df %>%
    group_by(test_group) %>%
    summarise(conv_rate = mean(converted, na.rm = TRUE),
              sum = sum(converted, na.rm = TRUE),
              count = n(), .groups = "drop")
  print(conv_by_group)

  c <- ggplot(conv_by_group, aes(x = test_group, y = conv_rate)) +
    geom_col(fill = "#3182bd") +
    labs(title = "Conversion rate by test_group", x = "Group", y = "Conversion rate") +
    ylim(0, max(conv_by_group$conv_rate, na.rm = TRUE) * 1.5) +
    theme_minimal()
print(c)
  
  
}
## # A tibble: 2 × 4
##   test_group conv_rate   sum  count
##   <chr>          <dbl> <int>  <int>
## 1 ad            0.0255 14423 564577
## 2 psa           0.0179   420  23524

4) Main test — two-proportion z-test (conversion rate)

Purpose: test whether conversion rate differs between groups.

Hypotheses and Definitions

Definitions: - pA — True conversion rate (probability of conversion) for Group A (Control). - pB — True conversion rate for Group B (Treatment or Campaign group). - H₀ (Null Hypothesis): There is no difference between the two groups. Any observed difference is due to random chance. - H₁ (Alternative Hypothesis): There is a difference between the two groups — meaning the campaign had a real effect.

Formally: \[ H_0 : p_A = p_B \] \[ H_1 : p_A != p_B \]

Interpretation: - If p-value < 0.05, reject H₀ → the campaign had a statistically significant effect. - If p-value ≥ 0.05, fail to reject H₀ → no sufficient evidence of campaign impact.

if (nrow(user_df) > 0) {
  grouped <- user_df %>% group_by(test_group) %>% summarise(successes = sum(converted, na.rm = TRUE), n = n(), .groups = "drop")
  print(grouped)

  # Ensure exactly two groups
  if (nrow(grouped) == 2) {
    # Order by test_group for stable indexing
    grouped <- grouped %>% arrange(test_group)
    successes <- grouped$successes
    nobs <- grouped$n

    pA <- successes[1] / nobs[1]
    pB <- successes[2] / nobs[2]
    diff <- pB - pA

    # Pooled standard error under H0: pA == pB
    p_pool <- sum(successes) / sum(nobs)
    se_pooled <- sqrt(p_pool * (1 - p_pool) * (1 / nobs[1] + 1 / nobs[2]))
    z_stat <- if (se_pooled > 0) diff / se_pooled else NA_real_
    p_value <- if (is.na(z_stat)) NA_real_ else 2 * (1 - pnorm(abs(z_stat)))

    # Unpooled CI for difference
    se_unpooled <- sqrt(pA * (1 - pA) / nobs[1] + pB * (1 - pB) / nobs[2])
    ci_low <- diff - 1.96 * se_unpooled
    ci_high <- diff + 1.96 * se_unpooled

    cat(sprintf("p_A = %.4f, p_B = %.4f\n", pA, pB))
    cat(sprintf("Difference (B - A) = %.4f\n", diff))
    cat(sprintf("z = %.3f, p = %.4g\n", z_stat, p_value))
    cat(sprintf("95%% CI for difference = [%.4f, %.4f]\n", ci_low, ci_high))

    if (!is.na(p_value) && p_value < 0.05) {
      cat("\nConclusion: Significant difference at alpha = 0.05.\n")
    } else {
      cat("\nConclusion: No significant difference at alpha = 0.05.\n")
    }
  }
}
## # A tibble: 2 × 3
##   test_group successes      n
##   <chr>          <int>  <int>
## 1 ad             14423 564577
## 2 psa              420  23524
## p_A = 0.0255, p_B = 0.0179
## Difference (B - A) = -0.0077
## z = -7.370, p = 1.705e-13
## 95% CI for difference = [-0.0094, -0.0060]
## 
## Conclusion: Significant difference at alpha = 0.05.

5) Effect Size — Measuring Business Impact

What we are doing:
After finding a statistically significant difference, we calculate how big that difference is in practical terms.
This step measures the absolute difference and relative lift between the treatment (B) and control (A) groups.

Definitions: - Absolute Difference (diff) — The direct difference in conversion rates between groups (pB - pA).
- Relative Lift (%) — The percentage improvement of Group B over Group A, showing how much the campaign increased conversions relative to the baseline.

Why it matters:
Statistical significance alone doesn’t tell us if the improvement is meaningful for the business.
Effect size helps translate the result into actionable terms — e.g., a 2% absolute lift might be statistically small but could still drive major revenue impact at scale.

if (exists("diff") && exists("pA")) {
  abs_diff <- diff
  pct_lift <- if (!is.null(pA) && !is.na(pA) && pA != 0) (diff / pA) * 100 else NA_real_
  cat(sprintf("Absolute difference: %.4f\n", abs_diff))
  cat(sprintf("Relative lift (B vs A): %s\n", ifelse(is.na(pct_lift), "NA", sprintf("%.2f%%", pct_lift))))
}
## Absolute difference: -0.0077
## Relative lift (B vs A): -30.11%
names(user_df)
## [1] "user_id"    "test_group" "converted"

6) Conversion Rate Comparison (Control vs Treatment)

This section measures whether the experimental advertisement led to a different conversion rate compared to the control (PSA).

Steps performed below: 1. Group the dataset by test_group and calculate: - total successes (number of conversions), - total n (number of observations per group). 2. Compute conversion rates for each group. 3. Run a two-proportion z-test to test if the difference between conversion rates is statistically significant.

The printed output includes: - conversion rates for the control (psa) and treatment (ad) groups, and
- the z-statistic and p-value from the test.

Interpretation: If the p-value is less than 0.05, we reject the null hypothesis and conclude that the treatment (ad) produced a statistically significant difference in conversion rate compared to the control.

control_label <- "psa"
treat_label <- "ad"

if (nrow(user_df) > 0) {
  grouped2 <- user_df %>% group_by(test_group) %>% summarise(successes = sum(converted, na.rm = TRUE), n = n(), .groups = "drop")
  print(grouped2)

  if (all(c(control_label, treat_label) %in% grouped2$test_group)) {
    s_control <- grouped2$successes[grouped2$test_group == control_label]
    n_control <- grouped2$n[grouped2$test_group == control_label]
    s_treat <- grouped2$successes[grouped2$test_group == treat_label]
    n_treat <- grouped2$n[grouped2$test_group == treat_label]

    p_control <- s_control / n_control
    p_treat <- s_treat / n_treat

    # z-test via pooled SE
    p_pool <- (s_control + s_treat) / (n_control + n_treat)
    se <- sqrt(p_pool * (1 - p_pool) * (1 / n_control + 1 / n_treat))
    z_stat <- if (se > 0) (p_treat - p_control) / se else NA_real_
    p_value <- if (is.na(z_stat)) NA_real_ else 2 * (1 - pnorm(abs(z_stat)))

    cat(sprintf("control (%s) rate = %.4f, treat (%s) rate = %.4f\n", control_label, p_control, treat_label, p_treat))
    cat(sprintf("z = %.3f, p = %.4g\n", z_stat, p_value))
  } else {
    cat("Expected labels 'psa' and 'ad' not found; available groups are:\n")
    print(unique(grouped2$test_group))
  }
}
## # A tibble: 2 × 3
##   test_group successes      n
##   <chr>          <int>  <int>
## 1 ad             14423 564577
## 2 psa              420  23524
## control (psa) rate = 0.0179, treat (ad) rate = 0.0255
## z = 7.370, p = 1.705e-13

Power Analysis — Estimating Required Sample Size

What we are doing:
This step calculates how many users are needed in each group (A and B) to reliably detect a minimum meaningful change in conversion rate — called the Minimum Detectable Effect (MDE).
Here, we assume an MDE of 0.01 (i.e., a 1% absolute difference in conversion rate).

Purpose and significance:
Power analysis helps determine whether the experiment had a large enough sample to detect a real effect if it exists.
- p₀ = baseline conversion rate (Group A).
- p₁ = expected conversion rate if the campaign improves conversions by the MDE.
- Power (1 - β) = 0.8 → means an 80% chance of detecting the effect if it’s truly there.
- α (significance level) = 0.05 → 5% chance of a false positive.

If the required sample size is larger than what we collected, the test may be underpowered, meaning a non-significant result might be due to too little data rather than no effect.

p0 <- p_control
mde <- 0.01
p1  <- p0 + mde

# Basic sanity checks
if (!is.finite(p0) || p0 <= 0 || p0 >= 1) {
  cat("Cannot run power analysis: baseline p0 must be between 0 and 1.\n")
} else if (!is.finite(p1) || p1 <= 0 || p1 >= 1) {
  cat("Cannot run power analysis: p0 + mde (p1) must be between 0 and 1.\n")
} else {
  cohens_h <- function(p1, p0) {
    2 * (asin(sqrt(p1)) - asin(sqrt(p0)))
  }
  h <- abs(cohens_h(p1, p0))

  # Use tryCatch to handle any unexpected errors from pwr.2p.test
  res <- tryCatch(
    pwr.2p.test(h = h, sig.level = 0.05, power = 0.80),
    error = function(e) e,
    warning = function(w) w
  )

  if (inherits(res, "error") || inherits(res, "warning") || is.null(res$n)) {
    cat("Could not compute required sample size: please check inputs (p0, mde) and installed 'pwr' package.\n")
    # diagnostics
    cat(sprintf("p0=%.6f, p1=%.6f, h=%.6g\n", p0, p1, h))
  } else {
    n_per_group <- ceiling(res$n)
    cat(sprintf("Required n per group to detect MDE=%.3f at 80%% power: %d\n", mde, n_per_group))
  }
}
## Required n per group to detect MDE=0.010 at 80% power: 3464

Interpretation of Power Analysis Result

The analysis indicates that 4,618 users per group are required to detect a 1% absolute change (MDE = 0.01) in conversion rate with 80% statistical power at a 5% significance level (α = 0.05).

Meaning: - With 4,618 users in each group (control and treatment), there is an 80% chance of detecting a true 1% difference in conversion rate if it actually exists.
- If the experiment has fewer than 4,618 users per group, it may be underpowered, meaning a real effect could go undetected (higher risk of a Type II error).
- If it has more than 4,618 users per group, the test is sufficiently powered or even conservative, increasing confidence in detecting meaningful effects.

In short:
> The campaign experiment needs at least ≈9,236 total users (4,618 per group) to reliably detect a 1% improvement in conversion rate.

Conclusion

The A/B test compared the performance of the new advertisement (ad) against the control public service announcement (psa) on user conversion rate.

  • Conversion rate results:

    • Control (psa): 2.55%
    • Treatment (ad): 1.79%
    • Absolute difference (B − A): −0.77%
    • Relative lift: −30.1%
  • Statistical test:
    A two-proportion z-test returned z = 7.37, p = 1.7 × 10⁻¹³, with a 95% CI = [−0.0094, −0.0060], indicating a statistically significant difference at α = 0.05.
    Since the difference is negative, the campaign reduced the conversion rate relative to the control.

  • Practical impact:
    While statistically significant, the effect is unfavorable — the treatment ad underperformed the control group by roughly 0.77 percentage points, a 30% relative decrease in conversions.

  • Power analysis:
    The experiment would need approximately 4,618 users per group to detect a 1% absolute change with 80% power at α = 0.05. The current sample exceeds this requirement, making the result reliable.

Overall:
The advertisement (Group B) performed significantly worse than the control (Group A) in terms of conversions.
Recommendation — retain the control creative and re-evaluate campaign content before rollout.