1. Introduction

Like any type of scientific testing, A/B testing is basically statistical hypothesis testing, or, in other words, statistical inference. It is an analytical method for making decisions that estimates population parameters based on sample statistics.

In business intelligence, it often applies to A groups of products and B groups of products for similar customers, to see which group we should produce and sell. On the other hands, it can also be applied to A groups of customers and B groups of customers for similar products, to see which group we shold taget for our products.

The common questions remain – Are my results statistically significant? Is my sample size big enough?

We decide on two rival hypotheses – the Null Hypothesis and the Alternative Hypothesis. The Null Hypothesis is the assumption that there is no difference between Version A and Version B, i.e. their mean conversion rates are equal. The Alternative Hypothesis is the assumption that there is a difference between Version A and Version B. In most A/B tests it is usually enough to know the two values are different, i.e. we will use a Two-tailed alternative hypothesis. A One-tailed alternative allows us to specify the direction of inequality.

To answer the first question: Are my results statistically significant? To reject the Null Hypothesis we need a p-value that is lower than the selected Significance Level(5%). i.e. if p < 0.05 for example, there is a difference between A and B versions.

\(~\)

1.1. Note on metrics

Generally, rate is used when you want to measure the usability of the site, and probability when you want to measure the impact.

\(~\)

1.2. Binomial Distribution

For a binomial distribution with probability p , the mean is given by p and the standard deviation is \(\sqrt {p∗(1−p)/N}\) where N is the number of trials. A binomial distribution can be used when

The outcomes are of 2 types Each event is independent of the other Each event has an identical distribution (i.e. p is the same for all)

\(~\)

1.3. Confidence Interval

A confidence interval indicates the range within which the mean is expected to fall in multiple trials of the experiment.

For e.g., consider p^ - the proportion of users that click, where N is the number of users. Let us assume a binomial distribution (this requires \(N\hat{p}>5\) and \(N(1−\hat{p})>5\)). The margin of error is given by

\(m=z∗se\)

\(m= z∗ \sqrt{\frac{\hat{p}.(1−\hat{p})}{N}}\)

For a 95% confidence interval, z = 1.96.

\(~\)

1.4. Hypothesis Testing

The null hypothesis states that the difference between the control and experiment is due to chance. If pcont and ptest are the control and test probabilities, then according to the null hypothesis

\(H0: p_{exp}−p_{cont}=0\)

The alternate hypothesis is that

\(H1: p_{exp}−p_{cont}≠0\)

\(~\)

1.5. Comparing two samples

For comparing two samples, we calculate the pooled standard error. For e.g., suppose Xcont and Ncont are the control number of users that click, and the total number of users in the control group. Let Xexp and Nexp be the values for the experiment. The pooled probability is given by

\(\hat{p}_{pool}= \frac{X_{cont}+X_{exp}} {N_{cont}+N_{test}}\)

\(SE_{pool}=\sqrt{\hat{p}_{pool}∗(1−\hat{p}_{pool})∗(\frac{1}{N_{cont}}+\frac{1}{N_{test})}}\)

\(\hat{d}=\hat{p}_{exp}−\hat{p}_{cont}\)

H0:d=0 where \(\hat{d}∼N(0,SE_{pool})\)

If \(\hat{d} >1.96∗SE_{pool}\) or \(\hat{d} <−1.96∗SE_{pool}\) then we can reject the null hypothesis and state that our difference represents a statistically significant difference

\(~\)

1.6. Practical significance

Practical significance is the level of change that you would expect to see from a business standpoint for the change to be valuable. What is considered practically significant can vary by field. In medicine, one would expect a 5,10 or 15% improvement for the result to be considered practically significant. At Google, for example, a 1-2% improvement in click through probability is practically significant.

The statistical significance bar is often lower than the practical significance bar, so that if the outcome is practically significance, it is also statistically significant.

\(~\)

1.7. Size vs Power trade-off

One of the decisions is to determine the number of data points needed to get a statistically significant result. This is called statistical power. Power has an inverse trade-off with size. The smaller the change you want to detect or the increased confidence you want to have in the result, means you have to run a larger experiment.

As you increase the number of samples, the confidence interval moves closer to the mean

α=P(reject null | null true)

β=P(fail to reject null | null false)

1−β is referred to as the sensitivity of the experiment, or statistical power. People often choose high sensitivity, typically around 80%.

For a small sample, α is low and β is high. For a large sample α remains the same but β goes down (i.e. sensitivity increases). As you change one of the parameters, your sample size will change as well. For example:

If you increase the baseline click through probability (under 0.5) then this increases the standard error, and therefore, you need a higher number of samples If you increase the practical significance level, you require a fewer number of samples since larger changes are easier to detect If you increase the confidence level, you want to be more certain that you are rejecting the null. At the same sensivitiy, this would require increasing the number of samples If you want to increase the sensitivity, you need to collect more samples.

The example below demonstrates the analysis of an A/B test:

N_cont = 10072  # Control samples (pageviews)
N_exp = 9886  # Test samples (pageviews)
X_cont = 974  # Control clicks
X_exp = 1242  # Exp. clicks

p_pool = (X_cont + X_exp)/(N_cont+N_exp)
se_pool = sqrt(p_pool*(1-p_pool)*(1/N_cont + 1/N_exp))

p_cont = X_cont/N_cont
p_exp = X_exp/N_exp
(d = p_exp - p_cont)
## [1] 0.02892847
(m = 1.96*se_pool)
## [1] 0.008717974
cf_min = d-m
cf_max = d+m
d_min = 0.02 # Minimum practical significance value for difference
cf_min
## [1] 0.0202105
cf_max
## [1] 0.03764645

In the above example, since the minimum confidence limit is greater than 0 and the practical significance level of 0.02, we conclude that it is highly probable that click through probability is higher than 0.02 and is significant. Based on this, one would launch the new version.

\(~\)

2. Metrics

Two types of checks

Invariant checking: Metrics that shouldn’t change between your test and control Evaluation: High level business metrics, user experience with the product How do we go about making a definition of a metric (for sanity checking)?

High level concept of metrics (e.g active users, CTR) Details (e.g. how do you define user activity) Take a set of metrics and summarize them into a single metric (e.g. overall evaluation criterion (OEC)) For evaluation, you can choose either one metric or a whole suite of metrics. If you have multiple metric, you can combine them into one metric, such as an objective function, or an Overall Evaluation Criterion (OEC) - a term that Microsoft uses.

The last situation is how generally applicable the metric is. If you are running a suite of A/B tests, it is preferable to have a metric that works across the entire suite.

User funnel indicates a series of steps taken by users through the site. It is called a funnel because every subsequent stage has fewer users than the stage above. Each stage is a metric - total count, rate, and probability (i.e. a unique user progressed down).

Some metrics may be difficult to measure because

they don’t have access to data and/or it takes too long to collect External data can be used. 3 categories of companies that gather data

Companies that collect data (e.g. Comscore, Nielsen) Companies that conduct surveys (e.g. Pew) Academic papers The above can help you benchmark your own metrics against the industry

Internal data can be used as well. You could do

Retrospective analysis: Look at historic data to look at changes and see the evaluation Surveys and User experience research: This helps you develop ideas on what you want to research The problem with these studies is that they show you correlation, and not causation, compared to running an experiment.

Talk to your colleagues about what ideas they think make sense for metrics.

You can gather additional data by

User Experience Research (UER) - high depth on a few users. This is good for brainstorming. You can also use special equipment in a UER (e.g. eye movement camera) that you cannot use on your website. You may want to validate the results using retrospective analysis

Focus groups: Medium depth and # of participants. Get feedback on hypotheticals, but may run into the issue of groupthink

Surveys have low depth but high # of participants: Useful for metrics you cannot directly measure. Can’t directly compare with other metrics since population for survey and internal metrics may be different.

Example of a metric:

High level metric - Click-through probability

You may have to filter out spam and fraud to de-bias the data. One way to figure out if you are biasing or de-biasing the data by filtering, is to slice your data and then calculate the metric for each slice after filterig. If you are affecting any slide disproportionately, then you may be biasing your data with filtering

To remove any weekly effects when looking say at total active cookies over time, use week-over-week i.e. divide current data by data from a week ago. Alternately, one can use year-over-year.

\(~\)

2.1. Characteristics of a metric:

Sensitivity and Robustness: Whether the metric is sensitive to changes you care about, and is robust to changes you don’t care about (e.g. mean is sensitive to outliers, median is robust but not sensitive to changes to small group of users). This can be measured by using prior experiments to see if the metric moves in a way that intuitively make sense. Another alternative is to do A/A tests to see if the metric picks up any spurious differences.

Distribution: Obtained by doing a distribution on the retrospective data.

\(~\)

2.2. Categories of metrics

4 categories of metrics

\(~\)

2.3. Absolute vs Relative

The simplest way to compare metrics for test and control is to take a difference.

If you are running a lot of experiments you want to use the relative difference i.e the percentage change. The main advantage of computing the percentage change is that you only have to choose one practical significance boundary to get stability over time. If you are running a lot of experiments over time, your metrics are probably changing over time. Using relative difference helps here by having to use one practical significance boundary rather than change it as the system changes. The main disadvantage is variability, relative differences such as ratios are not as well behaved as absolute differences

\(~\)

2.4. Calculating variability

We want to check the variability of a metric to later determine the sizing of the experiment and to analyze confidence intervals and draw conclusions. If we have a metric that varies a lot, then the practical significance level that we are looking for may not be feasible.

To calculate the confidence interval, you need

For a binomial distribution, your estimated variance is \(p(1−p)/N\). Estimated variance of the mean is \(σ2/N\). If the underlying data is normal, then the median will be normal. But if the underlying data is not normal, the median may not be normal. The mean is typically normally distributed irrespective of the distribution of the underlying data due to Central Limit Theorem. Difference in counts may be normally distributed. The variance of the difference will be the sum of the variances of each of the 2 counts. Rates tend to have a poisson distribution, and the variance for a poisson is just the mean. For ratios of test over control, both the mean and the variance depends on the distribution of the test and control metrics.

e.g.

x <- c(87029, 113407, 84843, 104994, 99327, 92052, 60684)
stder <- sd(x)/sqrt(length(x))
conf95_min = mean(x) -1.96*stder
conf95_max = mean(x) + 1.96*stder
conf95_min
## [1] 79157.54
conf95_max
## [1] 104367

\(~\) ###2.5. Non-parametric methods

This is a way to analyze data without making an assumption about the distribution. At Google, it was observed that the analytical estimates of variance was often under-estimated, and therefore they have resorted to use empirical measurements based on A/A test to evaluate variance. If you see a lot of variability in a metric in an A/A test, it is probably too sensitive to be used. Rather than do several multiple A/A tests, one way is to do a large A/A test, and then do bootstrap to generate small groups and test the variability.

With A/A tests, we can

In summary, different metrics have different variability. The variability may be high for certain metrics which makes them useless even if they make business or product sense. Computing the variability of a metric is tricky and one needs to take a lot of care.

For a lot of analysts, a majority of the time is spent is validating and choosing a metric compared to actually running the experiment. Being able to standardize the definitions was critical in the test. When measuring latency, are you talking about when the first byte loads and when a last byte loads. Also, for latency, the mean may not change at all. The signals (e.g. slow/fast connections or browsers) causes lumps in the distribution, and no central measure works. One needs to look at the right percentile metric. The key thing is that you are building intuition, you have to understand data, and the business, and work with the engineers to understand how the data is being captured.

\(~\)

3. Designing an Experiment

Typically you want to assign people and not events since the same user may see different changes. If you use a person, you typically use a cookie which may change by platform. The alternative then is to use a user id

\(~\)

3.1. Unit of Diversion

Commonly used units of diversion are:

  1. User identifier (id): Typically the username or email address used on the website. It is typically stable and unchanging. If user id is used as a unit of diversion, then it is either in the test group or the control group. User ID is personally identifiable

  2. Anonymous id: This is usually an anonymous identifier such as a cookie. It changes with browser or device. People may often refresh their cookies every time they log in. It is difficult to refresh a cookie on an app or a phone compared to the computer.

  3. Event: An event is a page load that can change for each user. This is used typically for changes that is not user facing.

Lesser used units of diversion are:

  1. Device id: Typically available for mobile devices. It is tied to a specific device and cannot be changed by the user.

  2. IP address: The ip address is location specific, but may change as the user changes location (e.g. testing on infrastructure change to test impact on latency)

Three main considerations in selecting an appropriate unit of diversion

Variability is higher when it is calculated empirically than when calculated analytically. This is because the unit of analysis (i.e. the denominator in the metric) is different from the unit of variability.

E.g. If unit of diversion is a query, then coverage (= #queries with ads/ # queries) will have lower variability compared to using a cookie as a unit of diversion. This is because when a query is used, the unit of diversion matches the unit of analysis (which is the denominator of the metric i.e. query)

In the medical industry, users are paired with each other based on location, demographics. However, given how little information there is on users on the internet, this is not widely practiced.

e.g. Consider an experiment where you are analyzing data for a particular region (NZ), and for the rest of the world. For the global data that includez NZ and the rest of the world, what is the pooled standard error?

For the other, you have

N_cont = 50000 + 6021
X_cont = 2500 + 302
N_exp = 50000 + 5979
X_exp = 2500 + 374

(p_cont = X_cont/N_cont)
## [1] 0.05001696
(p_exp = X_exp/N_exp)
## [1] 0.05134068
p_pool = (X_cont + X_exp)/(N_cont + N_exp)
(se_pool = sqrt(p_pool*(1-p_pool)*(1/N_cont + 1/N_exp)))
## [1] 0.00131081

Since \(abs(p_{cont}−p_{exp})<1.96∗se_{pool}\), the global difference is not statistically significant.

\(~\)

3.2. Population vs Cohort

A cohort is like an entering class for an analysis. A cohort may make more sense to look at a population when:

Looking for leaning effects Examining user retention Want to increase user activity Anything that requires the user to be established Practical considerations in experimental design

Duration When to run the experiment Fraction of the traffic to send to the experiment Two different types of learning effects

Change aversion Knowledge effect When users first encounter a change they will react, but will eventually adapt to a change.

\(~\)

4. Analyzing Results

\(~\)

4.1. Sanity Tests

One of the first things to do once you finish collecting experimental data is to analyze the invariants. This is done by calculating the values for one or more invariants on the test and control group, and check if the difference is statistically significant. For e.g. if the values for an invariant (say total # of cookies) are x and y, then calculate the se as \(\sqrt{\frac{0.5∗0.5}{x+y}}\), since one would expect the same number of cookies in both groups. Then calculate the margin as \(1.96∗se\). If the margin is greater than \(x/(x+y)−y/(x+y)\), then the difference of the invariant is insignificant. However if the difference is greater than the margin, then the difference is insignifiant and needs to be investigated further

An example is provided below:

control_event_ct = c(2451,2475,2394,2482,2374,1704,1468)
test_event_ct = c(2404,2507,2376,2444,2504,1612,1465)
control_total = sum(control_event_ct)
test_total= sum(test_event_ct)
(p_cont = control_total/(control_total+test_total))
## [1] 0.5005871
(p_test = test_total/(control_total + test_total))
## [1] 0.4994129
se = sqrt(0.5*0.5/(control_total+test_total))
margin = 1.96*se
(p_cf_min = 0.5 - margin)
## [1] 0.4944032
(p_cf_max = 0.5 + margin)
## [1] 0.5055968

\(~\)

4.2. Analysis with a Single Metric

# Data provided from test
Xs_cont = c(196, 200, 200, 216, 212, 185, 225, 187, 205, 211, 192, 196, 223, 192)
Ns_cont = c(2029, 1991, 1951, 1985, 1973, 2021, 2041, 1980, 1951, 1988, 1977, 2019, 2035, 2007) 
Xs_exp = c(179, 208, 205, 175, 191, 291, 278, 216, 225, 207, 205, 200, 297, 299)
Ns_exp = c(1971, 2009, 2049, 2015, 2027, 1979, 1959, 2020, 2049, 2012, 2023, 1981, 1965, 1993)

Xs_cont_sum = sum(Xs_cont)
Ns_cont_sum = sum(Ns_cont)
Xs_exp_sum = sum(Xs_exp)
Ns_exp_sum = sum(Ns_exp)

p_cont = Xs_cont_sum/Ns_cont_sum
p_exp = Xs_exp_sum/Ns_exp_sum

# Empirical standard error and count provided
empirical_se = 0.0062
empirical_ct = 5000
se = (sqrt(1/Ns_cont_sum + 1/Ns_exp_sum))*empirical_se/sqrt(1/empirical_ct + 1/empirical_ct)

# Calculating the cf for the difference
d = p_exp-p_cont
margin = se*1.96
d_c95min = d - margin
d_c95max = d + margin

# Sign test

diff_sign = Xs_exp/Ns_exp - Xs_cont/Ns_cont
pos_diff = sum()

One thing to be wary of is Simpson’s paradox, where the effect in aggregate may indicate one trend, and at a granular level may show an opposite trend.

\(~\)

4.3. Multiple checks

The more things you test, the more likely you are to see significant difference just by chance. This is a problem, but since it is not repeatable for the same metric across multiple attempts, there is a way out. One can do multiple runs of the experiment, or alternately bootstrap. There is another technique called multiple comparison that adjusts your significance levels that accounts for how many metrics or tests you are doing.

For e.g. if you had 10 metrics where you used a 95% confidence interval for each metric, what is the probability that one of the metrics will show up as a false positive?

p1 = 0.99
p_nofp = p1^10
p_fp = 1-p_nofp
p_fp
## [1] 0.09561792

As you increase the number of metrics, you can use a higher confidence level to overcome false positives.

A different method used in practice is Bonferroni correction. It has the advantages of being simple, makes no assumptions, and guaranteed to give αoverall as low as you have specified.

To use it, calculate \(α_{individual}=α_{\frac{overall}{n}}\)

For e.g. if you want αoverall to be 0.05 and there are 5 metrics then αindividual will be 0.05/3=0.01666

Bonferroni methods may be very conservative. Alternatives include closed testing procedure, Boole-Bonferroni bound and Holm-Bonferroni method. The αoverall above is often referred to as the familywise error rate (FWER). Another measure is the contol false discovery rate (FDR) defined as the (# false positives)/(#rejections). CDR makes sense if you have a large number (200) metrics.

An alternative to using multiple metrics is to use an ‘Overall Evaluation Criterion’ (OEC).

\(~\)

4.4. Gotchas

Effect may ramp out as you implement the change. There could be seasonal effects. For e.g. students on summer break have very different behavior than when they come back. Similarly during black friday and other holidays. One of the ways is to leave a small sample out as a hold-out to track them over time.

\(~\)

5. Summary

Decision to launch change should be guided by business reasons and not just Opportunity cost of launching the change (engineering costs, user experience), based on not launching it.