This document consists of notes based on the course ‘A/B Testing’ on Udacity by Diane Tang, Carrie Grimes, and Caroline Buckey. These notes are not comprehensive. Please refer to the course page for details.
A/B testing is a methodology for testing product changes. You split your users to two groups - the control group which sees the default feature, and an experimental group that sees the new features. Examples of A/B testing include Amazon personal recommendations, ranking change in Linkedin (e.g. whether to show a news item or a possible connection to a user). The goal of A/B testing is to design an experiment that is going to be robust and gives you repeatable results so that you can make a good decision about whether to launch that product or feature.
A/B testing is not good for testing new experiences. It may result in change aversion (where users don’t like changes to the norm), or a novelty effect (where users see something new and test out everything).
The two things with new experiences is a) having a baseline and b) how much time needs to be allowed for the users to adapt to the new experience, so you can say what is going to be the plateaud experience and make a robust decision.
Finally, A/B testing cannot tell you if you are missing something.
In these cases, user logs can be used to develop hypothesis that can then be used in an A/B test. A/B testing gives broad quantitiative data, while other techniques such as user research, focus groups, human evaluation give you deep qualitative data
A/B testing originated in agriculture. In medicine, their version of A/B testing is clinical trials. The difference between clinical trails and online A/B testing is that in clinical trials you have fewer patients with a lot of information on them. In online A/B tests, you have millions of users but limited information on the users.
Generally, rate is used when you want to measure the usability of the site, and probability when you want to measure the impact.
For a binomial distribution with probability \(p\) , the mean is given by \(p\) and the standard deviation is \(\sqrt{p*(1-p)/N}\) where \(N\) is the number of trials. A binomial distribution can be used when
A confidence interval indicates the range within which the mean is expected to fall in multiple trials of the experiment.
For e.g., consider \(\hat{p}\) - the proportion of users that click, where \(N\) is the number of users. Let us assume a binomial distribution (this requires \(N\hat{p}>5\) and \(N(1-\hat{p})>5\)). The margin of error is given by \[ m = z*se \] \[ m = z*\sqrt{\frac{\hat{p}.(1-\hat{p})}{N}} \]
For a 95% confidence interval, z = 1.96.
The null hypothesis states that the difference between the control and experiment is due to chance. If \(p_{cont}\) and \(p_{test}\) are the control and test probabilities, then according to the null hypothesis \[ H_0: p_{exp}-p_{cont} = 0 \]
The alternate hypothesis is that \[ H_1: p_{exp} - p_{cont} \neq 0 \]
For comparing two samples, we calculate the pooled standard error. For e.g., suppose \(X_{cont}\) and \(N_{cont}\) are the control number of users that click, and the total number of users in the control group. Let \(X_{exp}\) and \(N_{exp}\) be the values for the experiment. The pooled probability is given by \[ \hat{p}_{pool}= \frac{X_{cont}+X_{exp}}{N_{cont}+ N_{test}}\] \[ SE_{pool} = \sqrt{\hat{p}_{pool}*(1-\hat{p}_{pool})*(\frac{1}{N_{cont}}+\frac{1}{N_{test}})} \] \[ \hat{d} =\hat{p}_{exp}-\hat{p}_{cont}\] \[ H_0: d = 0 \text{ where } \hat{d} \sim N(0,SE_{pool})\]
If \(\hat{d} > 1.96*SE_{pool}\) or \(\hat{d} < -1.96*SE_{pool}\) then we can reject the null hypothesis and state that our difference represents a statistically significant difference
Practical significance is the level of change that you would expect to see from a business standpoint for the change to be valuable. What is considered practically significant can vary by field. In medicine, one would expect a 5,10 or 15% improvement for the result to be considered practically significant. At Google, for example, a 1-2% improvement in click through probability is practically significant.
The statistical significance bar is often lower than the practical significance bar, so that if the outcome is practically significance, it is also statistically significant.
One of the decisions is to determine the number of data points needed to get a statistically significant result. This is called statistical power. Power has an inverse trade-off with size. The smaller the change you want to detect or the increased confidence you want to have in the result, means you have to run a larger experiment.
As you increase the number of samples, the confidence interval moves closer to the mean
\[\alpha = P(\text{reject null | null true}) \] \[\beta = P(\text{fail to reject null | null false}) \] \(1-\beta\) is referred to as the sensitivity of the experiment, or statistical power. People often choose high sensitivity, typically around 80%.
For a small sample, \(\alpha\) is low and \(\beta\) is high. For a large sample \(\alpha\) remains the same but \(\beta\) goes down (i.e. sensitivity increases). A good online calculator for determing the number of samples is here. As you change one of the parameters, your sample size will change as well. For example:
The example below demonstrates the analysis of an A/B test:
N_cont = 10072 # Control samples (pageviews)
N_exp = 9886 # Test samples (pageviews)
X_cont = 974 # Control clicks
X_exp = 1242 # Exp. clicks
p_pool = (X_cont + X_exp)/(N_cont+N_exp)
se_pool = sqrt(p_pool*(1-p_pool)*(1/N_cont + 1/N_exp))
p_cont = X_cont/N_cont
p_exp = X_exp/N_exp
d = p_exp - p_cont
d
## [1] 0.02892847
m = 1.96*se_pool
m
## [1] 0.008717974
cf_min = d-m
cf_max = d+m
d_min = 0.02 # Minimum practical significance value for difference
cf_min
## [1] 0.0202105
cf_max
## [1] 0.03764645
In the above example, since the minimum confidence limit is greater than 0 and the practical significance level of 0.02, we conclude that it is highly probable that click through probability is higher than 0.02 and is significant. Based on this, one would launch the new version
Two types of checks
How do we go about making a definition of a metric (for sanity checking)?
For evaluation, you can choose either one metric or a whole suite of metrics. If you have multiple metric, you can combine them into one metric, such as an objective function, or an Overall Evaluation Criterion (OEC) - a term that Microsoft uses.
The last situation is how generally applicable the metric is. If you are running a suite of A/B tests, it is preferable to have a metric that works across the entire suite.
User funnel indicates a series of steps taken by users through the site. It is called a funnel because every subsequent stage has fewer users than the stage above. Each stage is a metric - total count, rate, and probability (i.e. a unique user progressed down).
Some metrics may be difficult to measure because
External data can be used. 3 categories of companies that gather data
The above can help you benchmark your own metrics against the industry
Internal data can be used as well. You could do
The problem with these studies is that they show you correlation, and not causation, compared to running an experiment.
Talk to your colleagues about what ideas they think make sense for metrics.
You can gather additional data by
User Experience Research (UER) - high depth on a few users. This is good for brainstorming. You can also use special equipment in a UER (e.g. eye movement camera) that you cannot use on your website. You may want to validate the results using retrospective analysis
Focus groups: Medium depth and # of participants. Get feedback on hypotheticals, but may run into the issue of groupthink
Surveys have low depth but high # of participants: Useful for metrics you cannot directly measure. Can’t directly compare with other metrics since population for survey and internal metrics may be different.
Example of a metric:
High level metric - Click-through probability
Def #1 (Cookie probability): For each
Def #2 (Pageview probability): Number of pageviews with a click within
Def #3 (Rate): Number of clicks divided by number of pageviews
You may have to filter out spam and fraud to de-bias the data. One way to figure out if you are biasing or de-biasing the data by filtering, is to slice your data and then calculate the metric for each slice after filterig. If you are affecting any slide disproportionately, then you may be biasing your data with filtering
To remove any weekly effects when looking say at total active cookies over time, use week-over-week i.e. divide current data by data from a week ago. Alternately, one can use year-over-year.
Sensitivity and Robustness: Whether the metric is sensitive to changes you care about, and is robust to changes you don’t care about (e.g. mean is sensitive to outliers, median is robust but not sensitive to changes to small group of users). This can be measured by using prior experiments to see if the metric moves in a way that intuitively make sense. Another alternative is to do A/A tests to see if the metric picks up any spurious differences
Distribution: Obtained by doing a distribution on the retrospective data
4 categories of metrics
The simplest way to compare metrics for test and control is to take a difference.
If you are running a lot of experiments you want to use the relative difference i.e the percentage change. The main advantage of computing the percentage change is that you only have to choose one practical significance boundary to get stability over time. If you are running a lot of experiments over time, your metrics are probably changing over time. Using relative difference helps here by having to use one practical significance boundary rather than change it as the system changes. The main disadvantage is variability, relative differences such as ratios are not as well behaved as absolute differences
We want to check the variability of a metric to later determine the sizing of the experiment and to analyze confidence intervals and draw conclusions. If we have a metric that varies a lot, then the practical significance level that we are looking for may not be feasible.
To calculate the confidence interval, you need
For a binomial distribution, your estimated variance is \(p(1-p)/N\). Estimated variance of the mean is \(\sigma^2/N\). If the underlying data is normal, then the median will be normal. But if the underlying data is not normal, the median may not be normal. The mean is typically normally distributed irrespective of the distribution of the underlying data due to Central Limit Theorem. Difference in counts may be normally distributed. The variance of the difference will be the sum of the variances of each of the 2 counts. Rates tend to have a poisson distribution, and the variance for a poisson is just the mean. For ratios of test over control, both the mean and the variance depends on the distribution of the test and control metrics.
e.g.
x <- c(87029, 113407, 84843, 104994, 99327, 92052, 60684)
stder <- sd(x)/sqrt(length(x))
conf95_min = mean(x) -1.96*stder
conf95_max = mean(x) + 1.96*stder
conf95_min
## [1] 79157.54
conf95_max
## [1] 104367
This is a way to analyze data without making an assumption about the distribution. At Google, it was observed that the analytical estimates of variance was often under-estimated, and therefore they have resorted to use empirical measurements based on A/A test to evaluate variance. If you see a lot of variability in a metric in an A/A test, it is probably too sensitive to be used. Rather than do several multiple A/A tests, one way is to do a large A/A test, and then do bootstrap to generate small groups and test the variability.
With A/A tests, we can
In summary, different metrics have different variability. The variability may be high for certain metrics which makes them useless even if they make business or product sense. Computing the variability of a metric is tricky and one needs to take a lot of care.
For a lot of analysts, a majority of the time is spent is validating and choosing a metric compared to actually running the experiment. Being able to standardize the definitions was critical in the test. When measuring latency, are you talking about when the first byte loads and when a last byte loads. Also, for latency, the mean may not change at all. The signals (e.g. slow/fast connections or browsers) causes lumps in the distribution, and no central measure works. One needs to look at the right percentile metric. The key thing is that you are building intuition, you have to understand data, and the business, and work with the engineers to understand how the data is being captured.
Typically you want to assign people and not events since the same user may see different changes. If you use a person, you typically use a cookie which may change by platform. The alternative then is to use a user id
Commonly used units of diversion are:
User identifier (id): Typically the username or email address used on the website. It is typically stable and unchanging. If user id is used as a unit of diversion, then it is either in the test group or the control group. User ID is personally identifiable
Anonymous id: This is usually an anonymous identifier such as a cookie. It changes with browser or device. People may often refresh their cookies every time they log in. It is difficult to refresh a cookie on an app or a phone compared to the computer.
Event: An event is a page load that can change for each user. This is used typically for changes that is not user facing.
Lesser used units of diversion are
Device id: Typically available for mobile devices. It is tied to a specific device and cannot be changed by the user.
IP address: The ip address is location specific, but may change as the user changes location (e.g. testing on infrastructure change to test impact on latency)
3 main considerations in selecting an appropriate unit of diversion
Variability is higher when it is calculated empirically than when calculated analytically. This is because the unit of analysis (i.e. the denominator in the metric) is different from the unit of variability.
E.g. If unit of diversion is a query, then coverage (= #queries with ads/ # queries) will have lower variability compared to using a cookie as a unit of diversion. This is because when a query is used, the unit of diversion matches the unit of analysis (which is the denominator of the metric i.e. query)
In the medical industry, users are paired with each other based on location, demographics. However, given how little information there is on users on the internet, this is not widely practiced.
e.g. Consider an experiment where you are analyzing data for a particular region (NZ), and for the rest of the world. For the global data that includez NZ and the rest of the world, what is the pooled standard error?
For the other, you have
N_cont = 50000 + 6021
X_cont = 2500 + 302
N_exp = 50000 + 5979
X_exp = 2500 + 374
p_cont = X_cont/N_cont
p_cont
## [1] 0.05001696
p_exp = X_exp/N_exp
p_exp
## [1] 0.05134068
p_pool = (X_cont + X_exp)/(N_cont + N_exp)
se_pool = sqrt(p_pool*(1-p_pool)*(1/N_cont + 1/N_exp))
se_pool
## [1] 0.00131081
Since \(abs(p_{cont} - p_{exp}) < 1.96*se_pool\), the global difference is not statistically significant
A cohort is like an entering class for an analysis. A cohort may make more sense to look at a population when:
Practical considerations in experimental design
Two different types of learning effects
When users first encounter a change they will react, but will eventually adapt to a change.
One of the first things to do once you finish collecting experimental data is to analyze the invariants. This is done by calculating the values for one or more invariants on the test and control group, and check if the difference is statistically significant. For e.g. if the values for an invariant (say total # of cookies) are x and y, then calculate the \(se\) as \(\sqrt{\frac{0.5*0.5}{x+y}}\), since one would expect the same number of cookies in both groups. Then calculate the margin as \(1.96*se\). If the margin is greater than \(x/(x+y)-y/(x+y)\), then the difference of the invariant is insignificant. However if the difference is greater than the margin, then the difference is insignifiant and needs to be investigated further
An example is provided below:
control_event_ct = c(2451,2475,2394,2482,2374,1704,1468)
test_event_ct = c(2404,2507,2376,2444,2504,1612,1465)
control_total = sum(control_event_ct)
test_total= sum(test_event_ct)
p_cont = control_total/(control_total+test_total)
p_test = test_total/(control_total + test_total)
p_cont
## [1] 0.5005871
p_test
## [1] 0.4994129
se = sqrt(0.5*0.5/(control_total+test_total))
margin = 1.96*se
p_cf_min = 0.5 - margin
p_cf_max = 0.5 + margin
p_cf_min
## [1] 0.4944032
p_cf_max
## [1] 0.5055968
The most common reasons for sanity checks failing is data capture. Other reasons could be experimental set-up, for e.g., where there is a filter on the test but not on the control
# Data provided from test
Xs_cont = c(196, 200, 200, 216, 212, 185, 225, 187, 205, 211, 192, 196, 223, 192)
Ns_cont = c(2029, 1991, 1951, 1985, 1973, 2021, 2041, 1980, 1951, 1988, 1977, 2019, 2035, 2007)
Xs_exp = c(179, 208, 205, 175, 191, 291, 278, 216, 225, 207, 205, 200, 297, 299)
Ns_exp = c(1971, 2009, 2049, 2015, 2027, 1979, 1959, 2020, 2049, 2012, 2023, 1981, 1965, 1993)
Xs_cont_sum = sum(Xs_cont)
Ns_cont_sum = sum(Ns_cont)
Xs_exp_sum = sum(Xs_exp)
Ns_exp_sum = sum(Ns_exp)
p_cont = Xs_cont_sum/Ns_cont_sum
p_exp = Xs_exp_sum/Ns_exp_sum
# Empirical standard error and count provided
empirical_se = 0.0062
empirical_ct = 5000
se = (sqrt(1/Ns_cont_sum + 1/Ns_exp_sum))*empirical_se/sqrt(1/empirical_ct + 1/empirical_ct)
# Calculating the cf for the difference
d = p_exp-p_cont
margin = se*1.96
d_c95min = d - margin
d_c95max = d + margin
# Sign test
diff_sign = Xs_exp/Ns_exp - Xs_cont/Ns_cont
pos_diff = sum()
One thing to be wary of is Simpson’s paradox, where the effect in aggregate may indicate one trend, and at a granular level may show an opposite trend.
The more things you test, the more likely you are to see significant difference just by chance. This is a problem, but since it is not repeatable for the same metric across multiple attempts, there is a way out. One can do multiple runs of the experiment, or alternately bootstrap. There is another technique called multiple comparison that adjusts your significance levels that accounts for how many metrics or tests you are doing.
For e.g. if you had 10 metrics where you used a 95% confidence interval for each metric, what is the probability that one of the metrics will show up as a false positive?
p1 = 0.99
p_nofp = p1^10
p_fp = 1-p_nofp
p_fp
## [1] 0.09561792
As you increase the number of metrics, you can use a higher confidence level to overcome false positives.
A different method used in practice is Bonferroni correction. It has the advantages of being simple, makes no assumptions, and guaranteed to give \(\alpha_{overall}\) as low as you have specified.
To use it, calculate \[ \alpha_{individual} = \frac{\alpha_{overall}}{n}\]
For e.g. if you want \(\alpha_{overall}\) to be 0.05 and there are 5 metrics then \(\alpha_{individual}\) will be \(0.05/3 = 0.01666\)
Bonferroni methods may be very conservative. Alternatives include closed testing procedure, Boole-Bonferroni bound and Holm-Bonferroni method. The \(\alpha_{overall}\) above is often referred to as the familywise error rate (FWER). Another measure is the contol false discovery rate (FDR) defined as the (# false positives)/(#rejections). CDR makes sense if you have a large number (200) metrics.
An alternative to using multiple metrics is to use an ‘Overall Evaluation Criterion’ (OEC).
Effect may ramp out as you implement the change. There could be seasonal effects. For e.g. students on summer break have very different behavior than when they come back. Similarly during black friday and other holidays. One of the ways is to leave a small sample out as a hold-out to track them over time.
change aversion: users averse to change reduce usage
interleaved experiment: an experiment in which you expose the same user to both A and B at the same time
interuser experiment: an experiment in which you expose users to either A or B.
intra-user experiment: is where you expose the same user to an experiment being on and of at different times.
novelty effect: users see something new and test out everything
retrospective analysis: look at historical data, observe changes, and conduct an evaluation
userflow: Shows the flow of users through the site, often referred to as a customer funnel.