A compact A/B testing notebook — EDA + Core Tests
This notebook contains a concise, practical workflow I used to
evaluate a marketing A/B test. It focuses on the minimal, statistically
sound steps needed to decide whether a campaign moved the primary KPI
(conversion rate). I include brief notes and in-place interpretations to
make the analysis easy to follow.
Assumptions - Unit of analysis is the user. If your
raw data has multiple events per user, aggregate first. - Expected
columns: user_id, test_group (values
A and B).
1) Load data
DATA_PATH <- "./marketing_AB.csv"
df_orgnl <- readr::read_csv(DATA_PATH, show_col_types = FALSE)
# display head
print(df_orgnl)
## # A tibble: 588,101 × 7
## ...1 `user id` `test group` converted `total ads` `most ads day`
## <dbl> <dbl> <chr> <lgl> <dbl> <chr>
## 1 0 1069124 ad FALSE 130 Monday
## 2 1 1119715 ad FALSE 93 Tuesday
## 3 2 1144181 ad FALSE 21 Tuesday
## 4 3 1435133 ad FALSE 355 Tuesday
## 5 4 1015700 ad FALSE 276 Friday
## 6 5 1137664 ad FALSE 734 Saturday
## 7 6 1116205 ad FALSE 264 Wednesday
## 8 7 1496843 ad FALSE 17 Sunday
## 9 8 1448851 ad FALSE 21 Tuesday
## 10 9 1446284 ad FALSE 142 Monday
## # ℹ 588,091 more rows
## # ℹ 1 more variable: `most ads hour` <dbl>
2) Preprocess & sanity checks
Ensure a single row per user.
df <- df_orgnl
summary(df)
## ...1 user id test group converted
## Min. : 0 Min. : 900000 Length:588101 Mode :logical
## 1st Qu.:147025 1st Qu.:1143190 Class :character FALSE:573258
## Median :294050 Median :1313725 Mode :character TRUE :14843
## Mean :294050 Mean :1310692
## 3rd Qu.:441075 3rd Qu.:1484088
## Max. :588100 Max. :1654483
## total ads most ads day most ads hour
## Min. : 1.00 Length:588101 Min. : 0.00
## 1st Qu.: 4.00 Class :character 1st Qu.:11.00
## Median : 13.00 Mode :character Median :14.00
## Mean : 24.82 Mean :14.47
## 3rd Qu.: 27.00 3rd Qu.:18.00
## Max. :2065.00 Max. :23.00
colSums(is.na(df))
## ...1 user id test group converted total ads
## 0 0 0 0 0
## most ads day most ads hour
## 0 0
# Drop index-like column if present
if ("Unnamed: 0" %in% names(df)) df <- dplyr::select(df, -`Unnamed: 0`)
# Rename columns for easier access
rename_map <- c(
`user id` = "user_id",
`test group` = "test_group",
`total ads` = "total_ads",
`most ads day` = "most_ads_day",
`most ads hour` = "most_ads_hour"
)
for (old in names(rename_map)) {
new <- rename_map[[old]]
if (old %in% names(df)) names(df)[names(df) == old] <- new
}
filter_outliers <- function(x, whisker = 1.5) {
q1 <- stats::quantile(x, 0.25, na.rm = TRUE)
q3 <- stats::quantile(x, 0.75, na.rm = TRUE)
iqr <- q3 - q1
lower <- q1 - whisker * iqr
upper <- q3 + whisker * iqr
which(x < lower | x > upper)
}
if ("total_ads" %in% names(df) && any(!is.na(df$total_ads))) {
idx <- filter_outliers(df$total_ads)
total_outliers <- length(idx)
total_records <- nrow(df)
pct_outliers <- total_outliers * 100 / total_records
cat(sprintf(
"\nOutlier Summary for total_ads:\n---\nTotal outliers: %d\nTotal records: %d\nPercentage of outliers: %.2f%%\n\n",
total_outliers, total_records, pct_outliers
))
# Compute whiskers for reference lines
q1 <- quantile(df$total_ads, 0.25, na.rm = TRUE)
q3 <- quantile(df$total_ads, 0.75, na.rm = TRUE)
iqr <- q3 - q1
lower <- q1 - 1.5 * iqr
upper <- q3 + 1.5 * iqr
# create a plotting dataframe with an explicit outlier flag and a constant x value
df_plot <- df %>%
mutate(
is_outlier = ifelse(!is.na(total_ads) & (total_ads < lower | total_ads > upper), TRUE, FALSE),
grp = factor("all"),
point_color = ifelse(is_outlier, "red", "black") # concrete color values to avoid scale issues
)
# Create visualization and force it to render
p <- ggplot(df_plot, aes(x = grp, y = total_ads)) +
geom_boxplot(fill = "#9ecae1", color = "#2171b5",
outlier.colour = "red", outlier.shape = 16, width = 0.25) +
# use identity scale by mapping to a concrete color column
geom_jitter(aes(color = point_color),
width = 0.15, alpha = 0.7, show.legend = FALSE, size = 1.5) +
scale_color_identity() +
geom_hline(yintercept = lower, linetype = "dashed", color = "red") +
geom_hline(yintercept = upper, linetype = "dashed", color = "red") +
labs(
title = "Outlier Detection — total_ads",
subtitle = sprintf("Detected %d outliers (%.2f%% of records)", total_outliers, pct_outliers),
y = "Total Ads",
x = NULL
) +
theme_minimal(base_size = 13) +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
print(p)
} else {
cat("No 'total_ads' column found or all values are NA - skipping outlier plot.\n")
}
##
## Outlier Summary for total_ads:
## ---
## Total outliers: 52057
## Total records: 588101
## Percentage of outliers: 8.85%

2) Conversion summary and simple plots.
# Build user-level frame
user_df <- df %>%
group_by(user_id) %>%
summarise(
test_group = dplyr::first(test_group),
converted = as.integer(max(converted, na.rm = TRUE))
) %>%
ungroup()
cat("\nTotal users:", nrow(user_df), "\n")
##
## Total users: 588101
user_df %>% group_by(test_group) %>% summarise(n_users = n(), conversions = sum(converted, na.rm = TRUE), .groups = "drop")
## # A tibble: 2 × 3
## test_group n_users conversions
## <chr> <int> <int>
## 1 ad 564577 14423
## 2 psa 23524 420
cat("\nMissing values by column:\n")
##
## Missing values by column:
sapply(user_df, function(x) sum(is.na(x)))
## user_id test_group converted
## 0 0 0
if (nrow(user_df) > 0) {
conv_by_group <- user_df %>%
group_by(test_group) %>%
summarise(conv_rate = mean(converted, na.rm = TRUE),
sum = sum(converted, na.rm = TRUE),
count = n(), .groups = "drop")
print(conv_by_group)
c <- ggplot(conv_by_group, aes(x = test_group, y = conv_rate)) +
geom_col(fill = "#3182bd") +
labs(title = "Conversion rate by test_group", x = "Group", y = "Conversion rate") +
ylim(0, max(conv_by_group$conv_rate, na.rm = TRUE) * 1.5) +
theme_minimal()
print(c)
}
## # A tibble: 2 × 4
## test_group conv_rate sum count
## <chr> <dbl> <int> <int>
## 1 ad 0.0255 14423 564577
## 2 psa 0.0179 420 23524

4) Main test — two-proportion z-test (conversion rate)
Purpose: test whether conversion rate differs between groups.
Hypotheses and Definitions
Definitions: - pA — True
conversion rate (probability of conversion) for Group A
(Control). - pB — True conversion rate for
Group B (Treatment or Campaign group). - H₀
(Null Hypothesis): There is no difference
between the two groups. Any observed difference is due to random chance.
- H₁ (Alternative Hypothesis): There is a
difference between the two groups — meaning the campaign had a
real effect.
Formally: \[
H_0 : p_A = p_B
\] \[
H_1 : p_A != p_B
\]
Interpretation: - If p-value <
0.05, reject H₀ → the campaign had a statistically significant
effect. - If p-value ≥ 0.05, fail to reject H₀ → no
sufficient evidence of campaign impact.
if (nrow(user_df) > 0) {
grouped <- user_df %>% group_by(test_group) %>% summarise(successes = sum(converted, na.rm = TRUE), n = n(), .groups = "drop")
print(grouped)
# Ensure exactly two groups
if (nrow(grouped) == 2) {
# Order by test_group for stable indexing
grouped <- grouped %>% arrange(test_group)
successes <- grouped$successes
nobs <- grouped$n
pA <- successes[1] / nobs[1]
pB <- successes[2] / nobs[2]
diff <- pB - pA
# Pooled standard error under H0: pA == pB
p_pool <- sum(successes) / sum(nobs)
se_pooled <- sqrt(p_pool * (1 - p_pool) * (1 / nobs[1] + 1 / nobs[2]))
z_stat <- if (se_pooled > 0) diff / se_pooled else NA_real_
p_value <- if (is.na(z_stat)) NA_real_ else 2 * (1 - pnorm(abs(z_stat)))
# Unpooled CI for difference
se_unpooled <- sqrt(pA * (1 - pA) / nobs[1] + pB * (1 - pB) / nobs[2])
ci_low <- diff - 1.96 * se_unpooled
ci_high <- diff + 1.96 * se_unpooled
cat(sprintf("p_A = %.4f, p_B = %.4f\n", pA, pB))
cat(sprintf("Difference (B - A) = %.4f\n", diff))
cat(sprintf("z = %.3f, p = %.4g\n", z_stat, p_value))
cat(sprintf("95%% CI for difference = [%.4f, %.4f]\n", ci_low, ci_high))
if (!is.na(p_value) && p_value < 0.05) {
cat("\nConclusion: Significant difference at alpha = 0.05.\n")
} else {
cat("\nConclusion: No significant difference at alpha = 0.05.\n")
}
}
}
## # A tibble: 2 × 3
## test_group successes n
## <chr> <int> <int>
## 1 ad 14423 564577
## 2 psa 420 23524
## p_A = 0.0255, p_B = 0.0179
## Difference (B - A) = -0.0077
## z = -7.370, p = 1.705e-13
## 95% CI for difference = [-0.0094, -0.0060]
##
## Conclusion: Significant difference at alpha = 0.05.
5) Effect Size — Measuring Business Impact
What we are doing:
After finding a statistically significant difference, we calculate how
big that difference is in practical terms.
This step measures the absolute difference and
relative lift between the treatment (B) and control (A)
groups.
Definitions: - Absolute Difference
(diff) — The direct difference in conversion rates between
groups (pB - pA).
- Relative Lift (%) — The percentage improvement of
Group B over Group A, showing how much the campaign increased
conversions relative to the baseline.
Why it matters:
Statistical significance alone doesn’t tell us if the improvement is
meaningful for the business.
Effect size helps translate the result into actionable terms — e.g., a
2% absolute lift might be statistically small but could still
drive major revenue impact at scale.
if (exists("diff") && exists("pA")) {
abs_diff <- diff
pct_lift <- if (!is.null(pA) && !is.na(pA) && pA != 0) (diff / pA) * 100 else NA_real_
cat(sprintf("Absolute difference: %.4f\n", abs_diff))
cat(sprintf("Relative lift (B vs A): %s\n", ifelse(is.na(pct_lift), "NA", sprintf("%.2f%%", pct_lift))))
}
## Absolute difference: -0.0077
## Relative lift (B vs A): -30.11%
names(user_df)
## [1] "user_id" "test_group" "converted"
6) Conversion Rate Comparison (Control vs Treatment)
This section measures whether the experimental advertisement led to a
different conversion rate compared to the control (PSA).
Steps performed below: 1. Group the dataset by
test_group and calculate: - total
successes (number of conversions), - total
n (number of observations per group). 2. Compute
conversion rates for each group. 3. Run a two-proportion
z-test to test if the difference between conversion rates is
statistically significant.
The printed output includes: - conversion rates for the control
(psa) and treatment (ad) groups, and
- the z-statistic and p-value from the test.
Interpretation: If the p-value is less than 0.05, we
reject the null hypothesis and conclude that the treatment (ad) produced
a statistically significant difference in conversion rate compared to
the control.
control_label <- "psa"
treat_label <- "ad"
if (nrow(user_df) > 0) {
grouped2 <- user_df %>% group_by(test_group) %>% summarise(successes = sum(converted, na.rm = TRUE), n = n(), .groups = "drop")
print(grouped2)
if (all(c(control_label, treat_label) %in% grouped2$test_group)) {
s_control <- grouped2$successes[grouped2$test_group == control_label]
n_control <- grouped2$n[grouped2$test_group == control_label]
s_treat <- grouped2$successes[grouped2$test_group == treat_label]
n_treat <- grouped2$n[grouped2$test_group == treat_label]
p_control <- s_control / n_control
p_treat <- s_treat / n_treat
# z-test via pooled SE
p_pool <- (s_control + s_treat) / (n_control + n_treat)
se <- sqrt(p_pool * (1 - p_pool) * (1 / n_control + 1 / n_treat))
z_stat <- if (se > 0) (p_treat - p_control) / se else NA_real_
p_value <- if (is.na(z_stat)) NA_real_ else 2 * (1 - pnorm(abs(z_stat)))
cat(sprintf("control (%s) rate = %.4f, treat (%s) rate = %.4f\n", control_label, p_control, treat_label, p_treat))
cat(sprintf("z = %.3f, p = %.4g\n", z_stat, p_value))
} else {
cat("Expected labels 'psa' and 'ad' not found; available groups are:\n")
print(unique(grouped2$test_group))
}
}
## # A tibble: 2 × 3
## test_group successes n
## <chr> <int> <int>
## 1 ad 14423 564577
## 2 psa 420 23524
## control (psa) rate = 0.0179, treat (ad) rate = 0.0255
## z = 7.370, p = 1.705e-13
Power Analysis — Estimating Required Sample Size
What we are doing:
This step calculates how many users are needed in each group (A and B)
to reliably detect a minimum meaningful change in conversion rate —
called the Minimum Detectable Effect (MDE).
Here, we assume an MDE of 0.01 (i.e., a 1% absolute
difference in conversion rate).
Purpose and significance:
Power analysis helps determine whether the experiment had a
large enough sample to detect a real effect if it
exists.
- p₀ = baseline conversion rate (Group A).
- p₁ = expected conversion rate if the campaign
improves conversions by the MDE.
- Power (1 - β) = 0.8 → means an 80% chance of
detecting the effect if it’s truly there.
- α (significance level) = 0.05 → 5% chance of a false
positive.
If the required sample size is larger than
what we collected, the test may be
underpowered, meaning a non-significant result might be
due to too little data rather than no effect.
p0 <- p_control
mde <- 0.01
p1 <- p0 + mde
# Basic sanity checks
if (!is.finite(p0) || p0 <= 0 || p0 >= 1) {
cat("Cannot run power analysis: baseline p0 must be between 0 and 1.\n")
} else if (!is.finite(p1) || p1 <= 0 || p1 >= 1) {
cat("Cannot run power analysis: p0 + mde (p1) must be between 0 and 1.\n")
} else {
cohens_h <- function(p1, p0) {
2 * (asin(sqrt(p1)) - asin(sqrt(p0)))
}
h <- abs(cohens_h(p1, p0))
# Use tryCatch to handle any unexpected errors from pwr.2p.test
res <- tryCatch(
pwr.2p.test(h = h, sig.level = 0.05, power = 0.80),
error = function(e) e,
warning = function(w) w
)
if (inherits(res, "error") || inherits(res, "warning") || is.null(res$n)) {
cat("Could not compute required sample size: please check inputs (p0, mde) and installed 'pwr' package.\n")
# diagnostics
cat(sprintf("p0=%.6f, p1=%.6f, h=%.6g\n", p0, p1, h))
} else {
n_per_group <- ceiling(res$n)
cat(sprintf("Required n per group to detect MDE=%.3f at 80%% power: %d\n", mde, n_per_group))
}
}
## Required n per group to detect MDE=0.010 at 80% power: 3464
Interpretation of Power Analysis Result
The analysis indicates that 4,618 users per group
are required to detect a 1% absolute change (MDE =
0.01) in conversion rate with 80% statistical
power at a 5% significance level (α =
0.05).
Meaning: - With 4,618 users in each group (control
and treatment), there is an 80% chance of detecting a
true 1% difference in conversion rate if it actually exists.
- If the experiment has fewer than 4,618 users per
group, it may be underpowered, meaning a real
effect could go undetected (higher risk of a Type II error).
- If it has more than 4,618 users per group, the test
is sufficiently powered or even conservative, increasing confidence in
detecting meaningful effects.
In short:
> The campaign experiment needs at least ≈9,236 total
users (4,618 per group) to reliably detect a 1% improvement in
conversion rate.
Conclusion
The A/B test compared the performance of the new advertisement
(ad) against the control public service announcement
(psa) on user conversion rate.
Conversion rate results:
- Control (psa): 2.55%
- Treatment (ad): 1.79%
- Absolute difference (B − A): −0.77%
- Relative lift: −30.1%
Statistical test:
A two-proportion z-test returned z = 7.37, p =
1.7 × 10⁻¹³, with a 95% CI = [−0.0094,
−0.0060], indicating a statistically significant difference at
α = 0.05.
Since the difference is negative, the campaign reduced the
conversion rate relative to the control.
Practical impact:
While statistically significant, the effect is unfavorable —
the treatment ad underperformed the control group by roughly
0.77 percentage points, a 30% relative
decrease in conversions.
Power analysis:
The experiment would need approximately 4,618 users per
group to detect a 1% absolute change with 80% power at α =
0.05. The current sample exceeds this requirement, making the result
reliable.
Overall:
The advertisement (Group B) performed significantly worse than the
control (Group A) in terms of conversions.
Recommendation — retain the control creative and re-evaluate campaign
content before rollout.