Ch. 1 - Mini case study in A/B Testing

Introduction

[Video]

Goals of A/B testing

The “hypothesis” for an A/B testing experiment refers to?

  • The variable you are going to measure.
  • The problem you are interested in testing.
  • The variable you are going to differ between conditions.
  • [*] What you think will happen as a result of the experiment.

Preliminary data exploration

# Load tidyverse
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.4
## ✓ tibble  3.0.1     ✓ dplyr   0.8.5
## ✓ tidyr   1.1.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# Read in data
click_data <- read_csv("click_data.csv")
## Parsed with column specification:
## cols(
##   visit_date = col_date(format = ""),
##   clicked_adopt_today = col_double()
## )
click_data
## # A tibble: 3,650 x 2
##    visit_date clicked_adopt_today
##    <date>                   <dbl>
##  1 2017-01-01                   1
##  2 2017-01-02                   1
##  3 2017-01-03                   0
##  4 2017-01-04                   1
##  5 2017-01-05                   1
##  6 2017-01-06                   0
##  7 2017-01-07                   0
##  8 2017-01-08                   0
##  9 2017-01-09                   0
## 10 2017-01-10                   0
## # … with 3,640 more rows
# Find oldest and most recent date
min(click_data$visit_date)
## [1] "2017-01-01"
max(click_data$visit_date)
## [1] "2017-12-31"

Baseline conversion rates

[Video]

Current conversion rate day of week

# Read in the data
click_data <- read_csv("click_data.csv")
## Parsed with column specification:
## cols(
##   visit_date = col_date(format = ""),
##   clicked_adopt_today = col_double()
## )
# Calculate the mean conversion rate by day of the week
click_data %>%
  group_by(wday(visit_date)) %>%
  summarize(conversion_rate = mean(clicked_adopt_today))
## # A tibble: 7 x 2
##   `wday(visit_date)` conversion_rate
##                <dbl>           <dbl>
## 1                  1           0.3  
## 2                  2           0.277
## 3                  3           0.271
## 4                  4           0.298
## 5                  5           0.271
## 6                  6           0.267
## 7                  7           0.256

Current conversion rate week

# Read in the data
click_data <- read_csv("click_data.csv")
## Parsed with column specification:
## cols(
##   visit_date = col_date(format = ""),
##   clicked_adopt_today = col_double()
## )
# Calculate the mean conversion rate by week of the year
click_data %>%
  group_by(week(visit_date)) %>%
  summarize(conversion_rate = mean(clicked_adopt_today))
## # A tibble: 53 x 2
##    `week(visit_date)` conversion_rate
##                 <dbl>           <dbl>
##  1                  1           0.229
##  2                  2           0.243
##  3                  3           0.171
##  4                  4           0.129
##  5                  5           0.157
##  6                  6           0.186
##  7                  7           0.257
##  8                  8           0.171
##  9                  9           0.186
## 10                 10           0.2  
## # … with 43 more rows

Plotting conversion rate seasonality

# Compute conversion rate by week of the year
click_data_sum <- click_data %>%
  group_by(week(visit_date)) %>%
  summarize(conversion_rate = mean(clicked_adopt_today))

# Build plot
ggplot(click_data_sum, aes(x = `week(visit_date)`,
                           y = conversion_rate)) +
  geom_point() +
  geom_line() +
  scale_y_continuous(limits = c(0, 1),
                     labels = percent)

Experimental design, power analysis

[Video]

Randomized vs. sequential

You’re designing a new experiment and you have two conditions. What’s the best method for comparing your two conditions?

  • Run control condition for a month and then test condition for a month.
  • Use old data as control condition and run test condition for a month.
  • [*] Run control and test conditions simultaneously for two months.
  • Run control condition for a month, wait a year, then run test condition for a month.

SSizeLogisticBin() documentation

Let’s take a moment to learn more about the SSizeLogisticBin() function from powerMediation that was introduced in the slides. Look at the documentation page for the SSizeLogisticBin() function by calling help(SSizeLogisticBin). The powerMediation package is pre-loaded for you. What is another way to phrase what p2 signifies?

  • The probability when X = 0 (the control condition).
  • [*] The probability when X = 1 (the test condition).
  • The proportion of the dataset where X = 1.
  • The Type I error rate.
  • The power for testing if the odds ratio is equal to one.

Power analysis August

# Load powerMediation
library(powerMediation)

# Compute and look at sample size for experiment in August
total_sample_size <- SSizeLogisticBin(p1 = 0.54,
                                      p2 = 0.64,
                                      B = 0.5,
                                      alpha = 0.05,
                                      power = 0.8)
total_sample_size
## [1] 758

Power analysis August 5 percentage point increase

# Load powerMediation
library(powerMediation)

# Compute and look at sample size for experiment in August with a 5 percentage point increase
total_sample_size <- SSizeLogisticBin(p1 = 0.54,
                                      p2 = 0.59,
                                      B = 0.5,
                                      alpha = 0.05,
                                      power = 0.8)
total_sample_size
## [1] 3085

Ch. 2 - Mini case study in A/B Testing Part 2

Analyzing results

[Video]

Plotting results

# Group and summarize data
experiment_data_clean_sum <- experiment_data_clean %>%
  group_by(visit_date, condition) %>%
  summarize(conversion_rate = mean(clicked_adopt_today))

# Make plot of conversion rates over time
ggplot(experiment_data_clean_sum,
       aes(x = visit_date,
           y = conversion_rate,
           color = condition,
           group = condition)) +
  geom_point() +
  geom_line()

glm() documentation

To analyze our results, we used the function glm() and set family to binomial. Take a look at the documentation to using ?glm to see what exactly the family argument is.

  • An optional vector of ‘prior weights’ to be used in the fitting process.
  • A symbolic description of the model to be fitted.
  • [*] A description of the error distribution and link function to be used in the model.
  • A list of parameters for controlling the fitting process.

Practice with glm()

# Load package for cleaning model results
library(broom)

# View summary of results
experiment_data_clean %>%
  group_by(condition) %>%
  summarize(conversion_rate = mean(clicked_adopt_today))
## # A tibble: 2 x 2
##   condition conversion_rate
##   <fct>               <dbl>
## 1 control             0.166
## 2 test                0.386
# Run logistic regression
experiment_results <- glm(clicked_adopt_today ~ condition,
                          family = "binomial",
                          data = experiment_data_clean) %>%
  tidy()
experiment_results
## # A tibble: 2 x 5
##   term          estimate std.error statistic  p.value
##   <chr>            <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)      -1.61     0.160    -10.1  5.38e-24
## 2 conditiontest     1.15     0.201      5.72 1.04e- 8

Designing follow-up experiments

[Video]

Follow-up experiment 1 design

You’re now designing your follow-up experiments. What’s the best path forward?

  • [*] Build one experiment where your test condition is a kitten in a hat. If the experiment works, run a second experiment with a kitten in a hat as the control and two kittens in hats as the test.
  • Build one experiment where your control is the cat in a hat and the test is two kittens in hats.
  • Build one experiment but with three conditions. Control is the cat in a hat, test one is a kitten in a hat, and test two is two kittens in hats.
  • Build one experiment but with three conditions. Control is the cat in a hat, test one is a kitten in a hat, and test two is two adult cats in hats.

Follow-up experiment 1 power analysis

# Load package for running power analysis
library(powerMediation)

# Run logistic regression power analysis
total_sample_size <- SSizeLogisticBin(p1 = 0.39,
                                      p2 = 0.59,
                                      B = 0.5,
                                      alpha = 0.05,
                                      power = 0.8)
total_sample_size
## [1] 194

Follow-up experiment 1 analysis

# Read in data for follow-up experiment
followup_experiment_data <- read_csv("followup_experiment_data.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   visit_date = col_date(format = ""),
##   condition = col_character(),
##   clicked_adopt_today = col_double()
## )
# View conversion rates by condition
followup_experiment_data %>%
  group_by(condition) %>%
  summarize(conversion_rate = mean(clicked_adopt_today))
## # A tibble: 2 x 2
##   condition  conversion_rate
##   <chr>                <dbl>
## 1 cat_hat              0.814
## 2 kitten_hat           0.876
# Run logistic regression
followup_experiment_results <- glm(clicked_adopt_today ~ condition,
                                   family = "binomial",
                                   data = followup_experiment_data) %>%
  tidy()
followup_experiment_results
## # A tibble: 2 x 5
##   term                estimate std.error statistic      p.value
##   <chr>                  <dbl>     <dbl>     <dbl>        <dbl>
## 1 (Intercept)            1.48      0.261      5.66 0.0000000149
## 2 conditionkitten_hat    0.479     0.404      1.18 0.236

Pre-follow-up experiment assumptions

[Video]

Plot 8 months data

# Compute monthly summary
eight_month_checkin_data_sum <- eight_month_checkin_data %>%
  mutate(month_text = month(visit_date, label = TRUE)) %>%
  group_by(month_text, condition) %>%
  summarize(conversion_rate = mean(clicked_adopt_today))
## Warning: Factor `month_text` contains implicit NA, consider using
## `forcats::fct_explicit_na`
# Plot month-over-month results
ggplot(eight_month_checkin_data_sum,
       aes(x = month_text,
           y = conversion_rate,
           color = condition,
           group = condition)) +
  geom_point() +
  geom_line()

Plot styling 1

# Plot monthly summary
ggplot(eight_month_checkin_data_sum,
       aes(x = month_text,
           y = conversion_rate,
           color = condition,
           group = condition)) +
  geom_point() +
  geom_line() +
  scale_y_continuous(limits = c(0, 1),
                     labels = percent) +
  labs(x = "Month",
       y = "Conversion Rate")

Plot styling 2

# Plot monthly summary
ggplot(eight_month_checkin_data_sum,
       aes(x = month_text,
           y = conversion_rate,
           color = condition,
           group = condition)) +
  geom_point(size = 4) +
  geom_line(lwd = 1) +
  scale_y_continuous(limits = c(0, 1),
                     labels = percent) +
  labs(x = "Month",
       y = "Conversion Rate")

Follow-up experiment assumptions

[Video]

Conversion rate between years

# Compute difference over time
no_hat_data_diff <- no_hat_data_sum %>%
  spread(year, conversion_rate) %>%
  mutate(year_diff = `2018` - `2017`)
no_hat_data_diff
##    month      2017      2018    year_diff
## 1    Apr 0.1433333 0.1366667 -0.006666666
## 2    Aug 0.5064516 0.5806452  0.074193548
## 3    Dec 0.4451613        NA           NA
## 4    Feb 0.1678571 0.2250000  0.057142857
## 5    Jan 0.1774194 0.1645161 -0.012903226
## 6    Jul 0.3903226 0.3451613 -0.045161291
## 7    Jun 0.2900000 0.3066667  0.016666667
## 8    Mar 0.1290323 0.1354839  0.006451613
## 9    May 0.2516129 0.2677419  0.016129032
## 10   Nov 0.2300000        NA           NA
## 11   Oct 0.2000000        NA           NA
## 12   Sep 0.2966667        NA           NA
# Compute summary statistics
mean(no_hat_data_diff$year_diff, na.rm = TRUE)
## [1] 0.01323157
sd(no_hat_data_diff$year_diff, na.rm = TRUE)
## [1] 0.03817146

Re-run power analysis for follow-up

# Load package for power analysis
library(powerMediation)

# Run power analysis for logistic regression
total_sample_size <- SSizeLogisticBin(p1 = 0.49,
                                      p2 = 0.64,
                                      B = 0.5,
                                      alpha = 0.05,
                                      power = 0.8)
total_sample_size
## [1] 341

Re-run glm() for follow-up

# Load package to clean up model outputs
library(broom)

# View summary of data
followup_experiment_data_sep %>%
  group_by(condition) %>%
  summarize(conversion_rate = mean(clicked_adopt_today))
## # A tibble: 2 x 2
##   condition  conversion_rate
##   <fct>                <dbl>
## 1 cat_hat              0.468
## 2 kitten_hat           0.614
# Run logistic regression
followup_experiment_sep_results <- glm(clicked_adopt_today ~ condition,
                                       family = "binomial",
                                       data = followup_experiment_data_sep) %>%
  tidy()
followup_experiment_sep_results
## # A tibble: 2 x 5
##   term                estimate std.error statistic p.value
##   <chr>                  <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)           -0.129     0.153    -0.841 0.401  
## 2 conditionkitten_hat    0.593     0.219     2.70  0.00688

Ch. 3 - Experimental Design in A/B Testing

A/B testing research questions

Article click frequency monthly

‘Like’ click frequency plot

‘Like’ / ‘Share’ click frequency plot

Assumptions and types of A/B testing

Between vs. within

Plotting A/A data

Analyzing A/A data

Confounding variables

Examples of confounding variables

Confounding variable example analysis

Confounding variable example plotting

Side effects

Confounding variable vs. side effect

Side effect load time plot

Side effects experiment plot


Ch. 4 - Statistical Analyses in A/B Testing

Power analyses

Logistic regression power analysis

pwr.t.test() documentation

T-test power analysis

Statistical tests

Logistic regression

T-test

Stopping rules and sequential analysis

What is a sequential analysis?

Sequential analysis three looks

Sequential analysis sample sizes

Multivariate testing

Plotting time homepage in multivariate experiment

Plotting ‘like’ clicks in multivariate experiment

Multivariate design statistical test

A/B Testing Recap


About Michael Mallari

Michael is a hybrid thinker and doer—a byproduct of being a StrengthsFinder “Learner” over time. With 20+ years of engineering, design, and product experience, he helps organizations identify market needs, mobilize internal and external resources, and deliver delightful digital customer experiences that align with business goals. He has been entrusted with problem-solving for brands—ranging from Fortune 500 companies to early-stage startups to not-for-profit organizations.

Michael earned his BS in Computer Science from New York Institute of Technology and his MBA from the University of Maryland, College Park. He is also a candidate to receive his MS in Applied Analytics from Columbia University.

LinkedIn | Twitter | www.michaelmallari.com/data | www.columbia.edu/~mm5470