Intro

This is a quick intro into how to calculate sample size for an A/B test involving lift on a percentage metric (like conversion rate).

Quick Theory

To estimate sample size, you need four out of five parameters:

Current Rate
Minimum Detectable Change (the minimum change you want to be able to detect)
Statistical Significance: the probability of mistakenly rejecting the null hypothesis (H0) if it were true
Statistical Power: probability of correctly rejecting the null hypothesis (H0) when the alternative (H1) is true. In other words, the ability of a test to detect an effect if the effect actually exists.

Alternative Functions to Estimate Sample Size

Function 1: power.prop.test

One version of a function that does this for us is power.prop.test. Function parameters:

n: number of observations per group
p1: probability in one group
p2: probability in other group
sig.level: significance level (Type I error probability)
power: power of the test (1 minus Type II error probability)
alternative: one or two sided test
strict: use strict interpretation in two-sided case
tol: numerical tolerance used in root finding

# Sample size needed to pick up a 1% increase in rate
power.prop.test(p1=.01, p2=.02, power=0.8, sig.level=0.05)

## 
##      Two-sample comparison of proportions power calculation 
## 
##               n = 2318.165
##              p1 = 0.01
##              p2 = 0.02
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

# Minimum increase in rate that can be picked up with 1000 samples (in each group)
power.prop.test(p1=.01, n=1000, power=0.8, sig.level=0.05)

## 
##      Two-sample comparison of proportions power calculation 
## 
##               n = 1000
##              p1 = 0.01
##              p2 = 0.02684817
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Function 2 pwr.2p2n.test{pwr}

This can be used to calculate sample sizes with unequal groups. The recommended values for h are: 0.2 for small effects, 0.5 for medium and 0.8 for big effects.

Parameters

h - Effect size.
n1 - Number of observations in the first sample
n2 - Number of observationsz in the second sample
sig.level - Significance level (Type I error probability)
power - Power of test (1 minus Type II error probability)
alternative - a character string specifying the alternative hypothesis, must be one of “two.sided” (default), “greater” or “less”

library("pwr")
pwr.2p2n.test(h=0.2, n1=1000, sig.level=0.05, power=0.8)

## 
##      difference of proportion power calculation for binomial distribution (arcsine transformation) 
## 
##               h = 0.2
##              n1 = 1000
##              n2 = 244.1239
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: different sample sizes

Function to Measure Actual Test Results prop.test

Actual Sample Size Test

prop.test(c(23, 48), c(2319,2319))

## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  c(23, 48) out of c(2319, 2319)
## X-squared = 8.2388, df = 1, p-value = 0.0041
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.01827178 -0.00328924
## sample estimates:
##      prop 1      prop 2 
## 0.009918068 0.020698577

A/B Testing Sample Size

Daniel Castro

September 11, 2017