A/B Testing in R

DataCamp: Statistics with R

Bonnie Cooper

library( dplyr )
library( ggplot2 )
library( gridExtra )
library( tidyverse )
library( lubridate )
library( scales )
library( powerMediation )
library( broom )
library( pwr )
library( gsDesign )

Mini case study in A/B Testing

Into

A/B testing is not something that is done just once. It is an iterative proces that is cycled over constantly in an effeort to optimize conversion rates etc.

Think of New Ideas to Test \(\longrightarrow\) Run Experiments \(\longrightarrow\) Statistically Analyze Results \(\longrightarrow\) Update to the Winning Idea \(\longrightarrow\) rinse \(\longrightarrow\) lather\(\longrightarrow\) repeat!

Clickthrough: did someone click the thing?
Clickthrough Rate: \(\frac{ \mbox{# site visitors who performed an action}}{\mbox{total # site visitors}}\)

An purrrrfect example of A/B testing:

  • Question: will changing the hompage photo result in more ‘ADOPT TODAY’ clicks?
  • Hypothesis: using a photo of a cat wearing a hat will result in more ‘ADOPT TODAY’ clicks
  • Dependent variable: clicked ‘ADOPT TODAY’ button or not
  • Independent variable: homepage photo
url <- 'https://assets.datacamp.com/production/repositories/2292/datasets/4407050e9b8216249a6d5ff22fd67fd4c44e7301/click_data.csv'
click_data <- read_csv( url )
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   visit_date = col_date(format = ""),
##   clicked_adopt_today = col_double()
## )
glimpse( click_data )
## Rows: 3,650
## Columns: 2
## $ visit_date          <date> 2017-01-01, 2017-01-02, 2017-01-03, 2017-01-04, …
## $ clicked_adopt_today <dbl> 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

EDA click_data

# Find oldest and most recent date
min(click_data$visit_date)
## [1] "2017-01-01"
max(click_data$visit_date)
## [1] "2017-12-31"

Baseline Conversion Rates

What is the current value (e.g. clickrate) before any experimental variable has been manipulated?
Need to know a baseline for comparison, otherwise, there is no way of knowing if the experimental manipulation had an effect or not.

find the current conversion rate:

click_data %>%
  summarize( conversion_rate = mean( clicked_adopt_today ) )
## # A tibble: 1 x 1
##   conversion_rate
##             <dbl>
## 1           0.277

there is an overall ~27% conversion rate for the years’ worth of data

now look at the clickthrough rate as a function of month to explore seasonality effects:

glimpse( click_data )
## Rows: 3,650
## Columns: 2
## $ visit_date          <date> 2017-01-01, 2017-01-02, 2017-01-03, 2017-01-04, …
## $ clicked_adopt_today <dbl> 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
click_data %>%
  mutate( month = format( visit_date, '%m' ) ) %>%
  group_by( month ) %>%
  summarise( conversion_rate = mean( clicked_adopt_today ) )
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 2
##    month conversion_rate
##    <chr>           <dbl>
##  1 01              0.197
##  2 02              0.189
##  3 03              0.145
##  4 04              0.15 
##  5 05              0.258
##  6 06              0.333
##  7 07              0.348
##  8 08              0.542
##  9 09              0.293
## 10 10              0.161
## 11 11              0.233
## 12 12              0.465

alternatively, the month() function from lubridate can be used:

click_data %>%
  group_by( month = month( visit_date ) ) %>%
  summarise( conversion_rate = mean( clicked_adopt_today ) )
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 2
##    month conversion_rate
##    <dbl>           <dbl>
##  1     1           0.197
##  2     2           0.189
##  3     3           0.145
##  4     4           0.15 
##  5     5           0.258
##  6     6           0.333
##  7     7           0.348
##  8     8           0.542
##  9     9           0.293
## 10    10           0.161
## 11    11           0.233
## 12    12           0.465

let’s visualize this result:

month_abs <- c( 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec')
click_data %>%
  group_by( month = month( visit_date ) ) %>%
  summarise( conversion_rate = mean( clicked_adopt_today ) ) %>%
  ggplot( aes( x = month, y = conversion_rate ) ) +
  geom_path() +
  geom_line() +
  scale_x_continuous( breaks = c(1:12), labels = month_abs )
## `summarise()` ungrouping output (override with `.groups` argument)

From this plot, we see that conversion rates are not steady across the months of the year. Rather, there is a peak during the summer months culminating in August as well as a peak in December.

How about days of the week?

wday_lab <- c( 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun' )
# Calculate the mean conversion rate by day of the week
click_data %>%
  group_by(wday = wday(visit_date)) %>%
  summarize(conversion_rate = mean(clicked_adopt_today)) %>%
  ggplot( aes( x = wday, y = conversion_rate ) ) +
  geom_path() +
  geom_line() +
  scale_x_continuous( breaks = c(1:7), labels = wday_lab ) +
  ylim( c( 0, 0.5 ) )
## `summarise()` ungrouping output (override with `.groups` argument)

We don’t observe much difference in conversion rate as a function of day of the week.

How about by week?

click_data %>%
  group_by(wk = week(visit_date)) %>%
  summarize(conversion_rate = mean(clicked_adopt_today)) %>%
  ggplot( aes( x = wk, y = conversion_rate ) ) +
  geom_point() +
  geom_line() +
  scale_y_continuous( limits = c( 0,1 ), labels = percent ) +
  scale_x_continuous( breaks = seq( 0,55, by=5 ), labels = month_abs )
## `summarise()` ungrouping output (override with `.groups` argument)

Experimental Design and Power Analysis

How long do we need to run our experiment?

Power Analysis

  • statistical test: what statistical test you plan on running
  • baseline value: value for the current control condition
  • desired value: expected value for the test condition
  • proportion of the data: from the test condition (ideally 0.5)
  • significance threshold/\(\alpha\): level where effect is significant (generally 0.05)
  • power/\(1-\beta\): probability correctly rejecting null hypothesis (generally 0.8)
total_sample_size <- SSizeLogisticBin( p1 = 0.2, #baseline
                                       p2 = 0.3, #our expected guess for the test condition
                                       B = 0.5, #typical val
                                       alpha = 0.05, #typical val
                                       power = 0.8 ) #typical val
res <- paste( 'Total Sample Size:', total_sample_size, 
              '\nSize for each condition:', total_sample_size/2 )
cat( res, sep = '\n' )
## Total Sample Size: 587 
## Size for each condition: 293.5
# Compute and look at sample size for experiment in August
total_sample_size <- SSizeLogisticBin(p1 = 0.54, #get the value for August
                                      p2 = 0.64,
                                      B = 0.5,
                                      alpha = 0.05,
                                      power = 0.8)
res <- paste( 'Total Sample Size:', total_sample_size, 
              '\nSize for each condition:', total_sample_size/2 )
cat( res, sep = '\n' )
## Total Sample Size: 758 
## Size for each condition: 379

Compare the above result with a predicted increase in clickthrough of 10% with that of the 5%

# Compute and look at sample size for experiment in August with a 5 percentage point increase
total_sample_size <- SSizeLogisticBin(p1 = 0.54,
                                      p2 = 0.59,
                                      B = 0.5,
                                      alpha = 0.05,
                                      power = 0.8)
res <- paste( 'Total Sample Size:', total_sample_size, 
              '\nSize for each condition:', total_sample_size/2 )
cat( res, sep = '\n' )
## Total Sample Size: 3085 
## Size for each condition: 1542.5

Mini case study in A/B Testing II

Analyzing Results

loading the experimental data:

url <- 'https://assets.datacamp.com/production/repositories/2292/datasets/52b52cb1ca28ce10f9a09689325c4d94d889a6da/experiment_data.csv'
experimental_data <- read_csv( url )
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   visit_date = col_date(format = ""),
##   condition = col_character(),
##   clicked_adopt_today = col_double()
## )
glimpse( experimental_data )
## Rows: 588
## Columns: 3
## $ visit_date          <date> 2018-01-01, 2018-01-01, 2018-01-01, 2018-01-01, …
## $ condition           <chr> "control", "control", "control", "control", "test…
## $ clicked_adopt_today <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1…

look at the conversion rate for each condition:

experimental_data %>%
  group_by( condition ) %>%
  summarise( conversion_rate = mean( clicked_adopt_today ) )
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   condition conversion_rate
##   <chr>               <dbl>
## 1 control             0.167
## 2 test                0.384

look at conversion rate as a function of visit date:

experimental_data_sum <- experimental_data %>%
  group_by( visit_date, condition ) %>%
  dplyr::summarize( conversion_rate = mean( clicked_adopt_today ) )
## `summarise()` regrouping output by 'visit_date' (override with `.groups` argument)
head( experimental_data_sum )
## # A tibble: 6 x 3
## # Groups:   visit_date [3]
##   visit_date condition conversion_rate
##   <date>     <chr>               <dbl>
## 1 2018-01-01 control             0.25 
## 2 2018-01-01 test                0.429
## 3 2018-01-02 control             0.25 
## 4 2018-01-02 test                0.333
## 5 2018-01-03 control             0    
## 6 2018-01-03 test                0.5

Now visualize:

ggplot( experimental_data_sum,
        aes( x = visit_date,
             y = conversion_rate,
             color = condition,
             group = condition ) ) +
  geom_point() +
  geom_line()

Generally, the test condition is higher than the control group on any given day. Next to support this observation with statistics.

#logistic regression
glm( clicked_adopt_today ~ condition,
     family = 'binomial',
     data = experimental_data ) %>%
  tidy()
## # A tibble: 2 x 5
##   term          estimate std.error statistic  p.value
##   <chr>            <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)      -1.61     0.156    -10.3  8.28e-25
## 2 conditiontest     1.14     0.197      5.77 7.73e- 9

The p-value for condition is very small (much smaller that conventional cutoffs: 0.05, 0.01), so we can reject the null hypothesis that there is no difference between groups. Additionally, the estimate for the test condition is ~1.14, therefore the test condition’s mean is ~1.14 greater than the control condition.

Designing Follow-up Experiments

Tips for desiging new experiments:

  • Build several small follow-up experiments. But make sure they are unique testable ideas that introduce 1 measurable change.
  • avoid confounding variables
  • test small changes

For the previous example the test condition’s conversion rate was 39%. Let’s find the sample size we would need for the next test condition where we estimate that the conversion rate will increase to 59%:

# Run logistic regression power analysis
total_sample_size <- SSizeLogisticBin(p1 = 0.39,
                                      p2 = 0.59,
                                      B = 0.5,
                                      alpha = 0.05,
                                      power = 0.8)
total_sample_size
## [1] 194

Pre-follow-up Experiment Assumptions

revisit the old data before executing a follow up experiment…

# Compute monthly summary
eight_month_checkin_data_sum <- eight_month_checkin_data %>%
  mutate(month_text = month(visit_date, label = TRUE)) %>%
  group_by(month_text, condition) %>%
  summarize(conversion_rate = mean( clicked_adopt_today ))

# Plot month-over-month results
ggplot(eight_month_checkin_data_sum,
       aes(x = month_text,
           y = conversion_rate,
           color = condition,
           group = condition)) +
  geom_point(size = 4) +
  geom_line( lwd = 1) +
  scale_y_continuous(limits = c(0, 1),
                     labels = percent) +
  labs(x = "Month",
       y = "Conversion Rate")

conversion rates are consistently higher

computing the differences in conversion rates for each month
(i do this here with the original dataset, but the course uses a different 8 month projection dataset which was not made available)

experimental_data_diff <- experimental_data %>%
  mutate( month_text = wday( visit_date, label = TRUE)) %>%
  group_by( month_text, condition ) %>%
  summarize( conversion_rate = mean( clicked_adopt_today )) %>%
  spread( condition, conversion_rate ) %>%
  mutate( condition_diff = test - control )
## `summarise()` regrouping output by 'month_text' (override with `.groups` argument)
head( experimental_data_diff ) 
## # A tibble: 6 x 4
## # Groups:   month_text [6]
##   month_text control  test condition_diff
##   <ord>        <dbl> <dbl>          <dbl>
## 1 Sun          0.175 0.389          0.214
## 2 Mon          0.235 0.395          0.160
## 3 Tue          0.156 0.302          0.147
## 4 Wed          0.119 0.426          0.306
## 5 Thu          0.194 0.386          0.192
## 6 Fri          0.135 0.467          0.332

What are the summary statistics for the differences in conversion rates?

summary( experimental_data_diff$condition_diff )
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1468  0.1630  0.1919  0.2167  0.2602  0.3315
mean( experimental_data_diff$condition_diff )
## [1] 0.216667
sd( experimental_data_diff$condition_diff )
## [1] 0.07363963

The 8 month dataset presented in the course videos results in a 19% mean difference in conversion rate with 4% standard deviation when broken down by month. Those numbers are fairly consistent with the results above.

For comparison, here we look at the difference between two different years worth of data for the control condition (dataset not available):

# Compute difference over time
no_hat_data_diff <- no_hat_data_sum %>%
  spread(year, conversion_rate) %>%
  mutate(year_diff = `2018` - `2017`)
no_hat_data_diff

# Compute summary statistics
mean(no_hat_data_diff$year_diff, na.rm = TRUE)
sd(no_hat_data_diff$year_diff, na.rm = TRUE)
  1. the conversion rate for the “no hat” condition in 2017 was 30% (or 0.3), and 2) the average difference between the “no hat” condition and the “cat hat” condition is 19% (0.19). Use this information to run an updated power analysis. assuming an increase of 15 percentage points (0.15) for the test condition.
# Run power analysis for logistic regression
total_sample_size <- SSizeLogisticBin(p1 = 0.49,
                                      p2 = 0.64,
                                      B = 0.5,
                                      alpha = 0.05,
                                      power = 0.8)
total_sample_size
## [1] 341

Experimental Design in A/B Testing

A/B Testing Research Questions

A/B Testing: is the use of experimental design and statistics to compare two or more variants of a design. Generally, A/B testing is a term used to refer to testing web site design. However, it can be used to describe any situation that tests 2 different conditions.

Uses for A/B testing:

  • Conversion rates ( e.g. clicks or purchases )
  • Engagement ( e.g. sharing, ’like’ing )
  • Dropoff rate ( e.g. leaving the site )
  • Time spent on a website
url <- 'https://assets.datacamp.com/production/repositories/2292/datasets/b502094e5de478105cccea959d4f915a7c0afe35/data_viz_website_2018_04.csv'
viz_website_2017 <- read_csv( url )
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   visit_date = col_date(format = ""),
##   condition = col_character(),
##   time_spent_homepage_sec = col_double(),
##   clicked_article = col_double(),
##   clicked_like = col_double(),
##   clicked_share = col_double()
## )
glimpse( viz_website_2017 )
## Rows: 30,000
## Columns: 6
## $ visit_date              <date> 2018-04-01, 2018-04-01, 2018-04-01, 2018-04-…
## $ condition               <chr> "tips", "tips", "tips", "tips", "tips", "tips…
## $ time_spent_homepage_sec <dbl> 49.01161, 48.86452, 49.07467, 49.26011, 50.37…
## $ clicked_article         <dbl> 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, …
## $ clicked_like            <dbl> 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, …
## $ clicked_share           <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

look at the average time spent on the homepage:

viz_website_2017 %>%
  summarise( mean( time_spent_homepage_sec ) )
## # A tibble: 1 x 1
##   `mean(time_spent_homepage_sec)`
##                             <dbl>
## 1                            50.0
viz_website_2017 %>%
  group_by( condition ) %>%
  summarise( mean( time_spent_homepage_sec ) )
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   condition `mean(time_spent_homepage_sec)`
##   <chr>                               <dbl>
## 1 tips                                 50.0
## 2 tools                                50.0
viz_website_2017 %>%
  group_by(condition) %>%
  summarise(article_conversion_rate = mean(clicked_article),
            like_conversion_rate = mean(clicked_like),
            share_conversion_rate = mean(clicked_share))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 4
##   condition article_conversion_rate like_conversion_rate share_conversion_rate
##   <chr>                       <dbl>                <dbl>                 <dbl>
## 1 tips                        0.602               0.166                 0.0329
## 2 tools                       0.608               0.0691                0.03
# Compute 'like' click summary by condition
viz_website_2017_like_sum <- viz_website_2017 %>%
  group_by(condition) %>%
  summarize(like_conversion_rate = mean(clicked_like))
## `summarise()` ungrouping output (override with `.groups` argument)
viz_website_2017_like_sum
## # A tibble: 2 x 2
##   condition like_conversion_rate
##   <chr>                    <dbl>
## 1 tips                    0.166 
## 2 tools                   0.0691
# Plot 'like' click summary by condition
ggplot(viz_website_2017_like_sum,
       aes(x = condition, y = like_conversion_rate, group = 1)) +
  geom_point() +
  geom_line() +
  scale_y_continuous(limits = c(0, 1), labels = percent)

month <- c( rep( c( 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec' ), 2 ) )
action <- c( rep( 'like', 12 ), rep( 'share', 12 ) )
conversion_rate <- c( 0.197, 0.118, 0.148, 0.166, 0.212, 0.297, 0.404, 0.125, 0.153, 0.202,
                      0.249, 0.294, 0.0516, 0.0113, 0.0192, 0.0296, 0.0501, 0.0701, 0.0203,
                      0.0104, 0.0177, 0.0385, 0.0607, 0.0188 )
viz_website_2017_like_share_sum <- data.frame( 'month' = month, 
                                               'action' = action, 
                                               'conversion_rate' = conversion_rate )
glimpse( viz_website_2017_like_share_sum )
## Rows: 24
## Columns: 3
## $ month           <chr> "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug…
## $ action          <chr> "like", "like", "like", "like", "like", "like", "like…
## $ conversion_rate <dbl> 0.1970, 0.1180, 0.1480, 0.1660, 0.2120, 0.2970, 0.404…
# Plot comparison of 'like'ing and 'sharing'ing an article
ggplot(viz_website_2017_like_share_sum,
       aes(x = month, y = conversion_rate, color = action, group = action)) +
  geom_point() +
  geom_line() +
  scale_y_continuous(limits = c(0,1), labels = percent)

People are much less likely to share an article than like it.

Assumptions and Types of A/B Testing

Within vs Between groups.

  • within - each participant sees both conditions
  • between - different groups of participants see different conditions
    • Assumption: there should be nothing qualitatively different between the two groups of participants so not to introduce any confounding variables.
  • A/B - compare a control and a test condition ( tips vs tools )
  • A/A - compare two groups of control condition ( tips(1) vs tips(2) ). this tests if the two groups are in fact from the same population and if the contol group really is stable. If you get a significant difference for the A/A test, then you are sampling the population wrong.
  • A/B/N - compare a control condition to any number of different test conditions (tips vs tools vs strategies )
    • can be tempting to go after, but the statistics are more complicated and it requires more data

Confounding Variables

Confounding Variable - is an element of the environment that could affect your ability to find out the truth of an A/B experiment.

condition <- c( rep( c( 'tips', 'tools' ), 56/2 ) )
article_published <- c( rep( 'no', 28 ), rep( 'yes', 28 ) )
visit_date <- c( sort( rep( seq(as.Date("2018-02-01"), by = "day", length.out = 28), 2 ) ) )
like_conversion_rate <- c( 0.112, 0.0171, 0.109, 0.0143, 0.118, 0.00996, 0.0977, 
                           0.0206, 0.101, 0.0262, 0.139, 0.0202, 0.115, 0.0206,
                           0.141, 0.0265, 0.124, 0.0348, 0.135, 0.00815, 0.117,
                           0.0304, 0.118, 0.0239, 0.108, 0.0193, 0.124, 0.00967,
                           0.107, 0.0816, 0.123, 0.0510, 0.103, 0.0714, 0.119,
                           0.0815, 0.136, 0.0658, 0.149, 0.0821, 0.130, 0.116,
                           0.106, 0.104, 0.107, 0.133, 0.134, 0.138, 0.108, 0.131,
                           0.135, 0.173, 0.113, 0.141, 0.0936, 0.126 )
viz_website_2018_02_sum <- data.frame( 'visit_date' = visit_date, 'condition' = condition,
                                       'article_published' = article_published, 
                                       'like_conversion_rate' = like_conversion_rate )                   

glimpse( viz_website_2018_02_sum )      
## Rows: 56
## Columns: 4
## $ visit_date           <date> 2018-02-01, 2018-02-01, 2018-02-02, 2018-02-02,…
## $ condition            <chr> "tips", "tools", "tips", "tools", "tips", "tools…
## $ article_published    <chr> "no", "no", "no", "no", "no", "no", "no", "no", …
## $ like_conversion_rate <dbl> 0.11200, 0.01710, 0.10900, 0.01430, 0.11800, 0.0…
# Plot 'like' conversion rates by date for experiment
ggplot(viz_website_2018_02_sum,
       aes(x = visit_date,
           y = like_conversion_rate,
           color = condition,
           linetype = article_published,
           group = interaction(condition, article_published))) +
  geom_point() +
  geom_line() +
  geom_vline(xintercept = as.numeric(as.Date("2018-02-15"))) +
  scale_y_continuous(limits = c(0, 0.3), labels = percent)

clearly, when the article was published is a confounding variable for the like conversion rate.

Side Effects

What if the test condition has an unintended effect (e.g. longer load time that may effect click through rates &/or actions)
examples: load time, Information ‘above the fold’ (Info that the person can see w/out scrolling)

condition <- c( rep( c( 'tips', 'tools' ), 62/2 ) )
pageload_delay_added <- c( rep( 'no', 31 ), rep( 'yes', 31 ) )
visit_date <- c( sort( rep( seq(as.Date("2018-03-01"), by = "day", length.out = 31), 2 ) ) )
like_conversion_rate <- c( 0.146, 0.0543, 0.151, 0.0514, 0.136, 0.048, 0.169, 0.0569,
                           0.143, 0.0576, 0.160, 0.0454, 0.169, 0.0487, 0.140, 0.0395,
                           0.146, 0.0573, 0.135, 0.044, 0.141, 0.0451, 0.139, 0.0496,
                           0.159, 0.0534, 0.165, 0.0419, 0.101, 0.0548, 0.110, 0.0359,
                           0.105, 0.0561, 0.103, 0.0535, 0.0840, 0.0454, 0.109, 0.0458,
                           0.0846, 0.0395, 0.115, 0.0610, 0.111, 0.0758, 0.112, 0.0482,
                           0.103, 0.0323, 0.0948, 0.0520, 0.110, 0.0638, 0.0981, 0.0447,
                           0.108, 0.0701, 0.0795, 0.0544, 0.102, 0.0437 )
                           
viz_website_2018_03_sum <- data.frame( 'visit_date' = visit_date, 'condition' = condition,
                                       'pageload_delay_added' = pageload_delay_added, 
                                       'like_conversion_rate' = like_conversion_rate ) 
glimpse( viz_website_2018_03_sum )
## Rows: 62
## Columns: 4
## $ visit_date           <date> 2018-03-01, 2018-03-01, 2018-03-02, 2018-03-02,…
## $ condition            <chr> "tips", "tools", "tips", "tools", "tips", "tools…
## $ pageload_delay_added <chr> "no", "no", "no", "no", "no", "no", "no", "no", …
## $ like_conversion_rate <dbl> 0.1460, 0.0543, 0.1510, 0.0514, 0.1360, 0.0480, …
# Plot 'like' conversion rate by day
ggplot(viz_website_2018_03_sum,
       aes(x = visit_date,
           y = like_conversion_rate,
           color = condition,
           linetype = pageload_delay_added,
           group = interaction(condition, pageload_delay_added))) +
  geom_point() +
  geom_line() +
  geom_vline(xintercept = as.numeric(as.Date("2018-03-15"))) +
  scale_y_continuous(limits = c(0, 0.3), labels = percent)

Adding a page delay had a visible effect on the like conversion rate of ‘tips’

Statistical Analysis in A/B Testing

Power Analysis

What are power analyses?

Power

  • The probability of rejecting the null hypothesis when it is false.
  • The basis of procedures for estimating the sample size needed to detect an effect of a particular magnitude
  • Power gives a method of discriminating between competing tests of the same hypothesis, the test with the higherpower being preferred.

Significance Level (\(\alpha\))

  • The level of probability at which it is agreed that the null hypothesis will be rejected
  • Conventionally set at 0.05, 0.01 is common as well.

Effect Size

  • Most commonly the difference between the control group and the experimental group population means of a response variable divided by the assumed common population standard deviation.
  • Estimated by the difference of the sample means in the two groups divided by a pooled estimate of the assumed common standard deviation.

Power analysis relationships: with significance level and effect size held constant, the number of data points required for a given power increases. Conversely, if power and effect size are held constant, the number of data points required decreases with higher values of significance level. If power and significance level are held constant, the number of data points needed decreases for larger effect sizes being measured.
So, in summary, the higher the power, smaller the effect size and lower significance level the test has, the more data samples that will be needed.

pwr.t.test( power = 0.8,
            sig.level = 0.05,
            d = 0.6 )
## 
##      Two-sample t test power calculation 
## 
##               n = 44.58577
##               d = 0.6
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

what happens to the number of data points needed if the effect size is smaller?

pwr.t.test( power = 0.8,
            sig.level = 0.05,
            d = 0.2 )
## 
##      Two-sample t test power calculation 
## 
##               n = 393.4057
##               d = 0.2
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

n, the number of data points, has dramatically increased.

Statistical Tests

Common statistical tests for A/B testing.*

  • logistic regression - a binary categorical dependent variable (e.g. clicked or didn’t click)
  • t-test (linear regression) - a continuous dependent variable (e.g. time spent on website)
#example t-test
glimpse( viz_website_2017 )
## Rows: 30,000
## Columns: 6
## $ visit_date              <date> 2018-04-01, 2018-04-01, 2018-04-01, 2018-04-…
## $ condition               <chr> "tips", "tips", "tips", "tips", "tips", "tips…
## $ time_spent_homepage_sec <dbl> 49.01161, 48.86452, 49.07467, 49.26011, 50.37…
## $ clicked_article         <dbl> 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, …
## $ clicked_like            <dbl> 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, …
## $ clicked_share           <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
example_ttest <- t.test( time_spent_homepage_sec ~ condition,
                         data = viz_website_2017 )
example_ttest
## 
##  Welch Two Sample t-test
## 
## data:  time_spent_homepage_sec by condition
## t = 0.36288, df = 29997, p-value = 0.7167
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.01850573  0.02691480
## sample estimates:
##  mean in group tips mean in group tools 
##            49.99909            49.99489

The p-vaue is 0.7167, far from significant. Therefor we cannot reject the null hypothesis.

Since there are only two levels for the independent variable, we will get the same result for the ttest if we run a linear regression.

lm( time_spent_homepage_sec ~ condition, data = viz_website_2017 ) %>%
  summary()
## 
## Call:
## lm(formula = time_spent_homepage_sec ~ condition, data = viz_website_2017)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8992 -0.6728  0.0033  0.6727  4.0290 
## 
## Coefficients:
##                 Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)    49.999093   0.008193 6102.679   <2e-16 ***
## conditiontools -0.004205   0.011587   -0.363    0.717    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.003 on 29998 degrees of freedom
## Multiple R-squared:  4.39e-06,   Adjusted R-squared:  -2.895e-05 
## F-statistic: 0.1317 on 1 and 29998 DF,  p-value: 0.7167

run another logistic regression

# Run logistic regression
ab_experiment_results <- glm(clicked_like ~ condition,
                             family = "binomial",
                             data = viz_website_2017) %>%
  tidy()
ab_experiment_results
## # A tibble: 2 x 5
##   term           estimate std.error statistic   p.value
##   <chr>             <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)      -1.61     0.0219     -73.5 0.       
## 2 conditiontools   -0.989    0.0390     -25.4 4.13e-142

tools has a lower ‘like’ click rate than tips.

Stopping Rules & Sequential Analysis

Stopping Rules: Procedures that allow interim analysis in clinical trials at predefined times, while preserving the type 1 error at some pre-specified level.

also known as…
Sequential Analysis: A procedure in which a statistical test of significance is conducted repreatedly over time as the data are collected. After each observation, the cumulative data are analyzed and one of the following three decisions taken:

  • STOP: reject the null hypothesis and claim statistical significance
  • STOP: do not reject the null hypothesis and state that the results are not statistically significant
  • Continue: since as yet the cumulative data are inadequate to draw a conclusion

Stopping rules are helpful because, by being systematically built in to the experimental design, they can prevent p-value fishing trips (p-hacking). Stopping rules are helpful for dealing with situations where little is known about the effect size while more effectively allocating resources

seq_analysis <- gsDesign( k = 4,
                          test.type = 1,
                          alpha = 0.05,
                          beta = 0.2,
                          sfu = 'Pocock' )
seq_analysis
## One-sided group sequential design with
## 80 % power and 5 % Type I Error.
##            Sample
##             Size 
##   Analysis Ratio*  Z   Nominal p  Spend
##          1  0.306 2.07    0.0193 0.0193
##          2  0.612 2.07    0.0193 0.0132
##          3  0.918 2.07    0.0193 0.0098
##          4  1.224 2.07    0.0193 0.0077
##      Total                       0.0500 
## 
## ++ alpha spending:
##  Pocock boundary.
## * Sample size ratio compared to fixed design with no interim
## 
## Boundary crossing probabilities and expected sample size
## assume any cross stops the trial
## 
## Upper boundary (power or Type I Error)
##           Analysis
##    Theta      1      2      3      4 Total   E{N}
##   0.0000 0.0193 0.0132 0.0098 0.0077  0.05 1.1952
##   2.4865 0.2445 0.2455 0.1845 0.1255  0.80 0.7929
max_n <- 1000
max_n_per_group <- max_n / 2
stopping_points <- max_n_per_group * seq_analysis$timing
stopping_points
## [1] 125 250 375 500

At any of the above listed points, if there is a significant result, we can stop collecting data.

Mulivariate Testing

what is you absolutely have to test 2 different aspects to evaluate?
Go with multivariate analysis:
multivar_results <- lm( dependent_var ~ var_1 * var_2, data = df )

glimpse( viz_website_2017 )
## Rows: 30,000
## Columns: 6
## $ visit_date              <date> 2018-04-01, 2018-04-01, 2018-04-01, 2018-04-…
## $ condition               <chr> "tips", "tips", "tips", "tips", "tips", "tips…
## $ time_spent_homepage_sec <dbl> 49.01161, 48.86452, 49.07467, 49.26011, 50.37…
## $ clicked_article         <dbl> 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, …
## $ clicked_like            <dbl> 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, …
## $ clicked_share           <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

Try a time spent on homepage multivariate analysis for likes by shares (just for demonstration)

multivar_res <- viz_website_2017 %>%
  mutate( var_1 = factor( clicked_like, levels = c( 0, 1 ) ),
          var_2 = factor( clicked_share, levels = c( 0, 1 ) ) ) %>%
  lm( time_spent_homepage_sec ~ var_1 * var_2,
      data = . ) %>%
  tidy()

multivar_res
## # A tibble: 4 x 5
##   term          estimate std.error statistic p.value
##   <chr>            <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)   50.0       0.00626  7981.      0    
## 2 var_11        -0.00519   0.0183     -0.283   0.777
## 3 var_21         0.0119    0.0357      0.332   0.740
## 4 var_11:var_21  0.111     0.0965      1.14    0.252