Impact of Random Sampling on AB Tests KPI’s

Overview

In the business world, it’s common to run AB Tests, or Split Tests, to help understand the impact of making changes to a given feature, marketing campaign, website, etc. When performing AB test, we usually randomly assign users to one of two group, each getting a different treatment. We then collect data over some period of time then compare difference between the groups and draw conclusions whether the chage we introduced led to an improvement or not.

A huge challenge for many business analysts is understand the concept of uncertainty, how this arises from random sampling and that AB tests are inherently an exercise in drawing samples from population. This blog illustrates how random selection can influence the apparent monetary value players in a mobile game. We will be simulating an AB test, but instead of giving each group a different treatment, we will randomly drawn groups from the same base distribution modeled on real player data.

In this demonstration, we start by loading 31,797 real players from a mobile game. These players are NOT in an AB test … these are just regular players in the game.

Next, we randomly assign these players to either Group 1 or Group 2. Since we aren’t running an experiment, we would expect any KPI’s drawn from these groups to be essentially the same. We will simply look at total revenue (summed) for each randomly drawn group.

We will then calculate the difference between the total summed revenue between Group 1 and Group 2 and save this difference. We now repeat this process 1000 times - randomly grouping players into 2 groups, calculating the difference between their group total revenue, and saving this for later analysis.

Lastly, we plot a histograms showing the difference in revenue between groups and total revenue we saw within each group for each random draw.

This should help illustrate the inherent variability in KPI’s between samples drawn from the same population, i.e. theoretically exact same groups. The challenge for business analysts is to understand that just because one group appears to have higher revenue that the other group over the course of an AB test, that doesn’t necessarily mean that group is the winner of the AB test. We need to use statistical tests that tell us whether the differences we are seeing are more likely from random chance (as seen here) or an actual experimental effect (i.e. we see a difference greater than might be seen by random chance).

num_simulations <- 1000  # How many simulated AB tests will we run?
set.seed(424242)         # Initial random seed for reproducibility
options(scipen=999)      # Disable scientific notation

Load Sample Data

# Load R libraries
library(MASS)
library(caret)
library(tidyverse)
library(mixtools)
library(ggplot2)

# Load Sample Player data
df <- read.csv('./player_data.csv')

# We will only be inspecting the lifetime revenue for each player (called life_viapo in the csv file)
df <- df %>%
  mutate(revenue = life_viapo) %>%
  dplyr::select(revenue)

# Add an id column to simulate the player's ID
df$id <- seq(1, nrow(df))

head(df)

##     revenue id
## 1  0.000020  1
## 2 16.297810  2
## 3  2.768755  3
## 4  1.147800  4
## 5  0.000020  5
## 6 17.144170  6

Perform random sampling

We will now randomly assign each player to a group (Group_1 or Group_2) and repeat this process 1000 times. This process is simulating us rerunning the AB Test over and over again.

# Create some empty data.frame's to hold the results of each simulated AB Test
group1_total <- data.frame(id=numeric(0), revenue=numeric(0))
group2_total <- data.frame(id=numeric(0), revenue=numeric(0))
group_diff <- data.frame(id=numeric(0), delta=numeric(0))
percent_diff <- data.frame(id=numeric(0), p_diff=numeric(0))

# What is the total revenue across all players?
total_revenue <- sum(df$revenue)

# Simulation loop - each iteration will simulate rerunning the AB test using the same players
for (i in 1:num_simulations) {
  # Split the players into 2 randomized groups
  df$Group <- sample(1:2, nrow(df), replace=T)

  # Calculate the total revenue for Group1
  group1_revenue <- df %>% 
    filter(Group == 1) %>% 
    select(revenue) %>%
    sum()
  
  # Calculate the total revenue for Group2
  group2_revenue <- df %>% 
    filter(Group == 2) %>% 
    select(revenue) %>%
    sum()
  
  # Add these totals to our tracking data.frame  
  group1_total <- rbind(group1_total, list(id=i, revenue=group1_revenue))
  group2_total <- rbind(group2_total, list(id=i, revenue=group2_revenue))
  
  # Calculate the difference in revenue between the groups and save off for later analysis
  delta <- sum(group2_revenue - group1_revenue)
  group_diff <- rbind(group_diff, list(id=i, delta=delta))
  percent_diff <- rbind(percent_diff, list(id=i, p_diff = 100 * delta / (group1_revenue + group2_revenue)))
}

Results

Here we see the distribution of results across all our simulations. Note that its a normal distribution (as expected), centered close to 0. Since the two groups are being drawn from the same population, we would expect each group to be very similar and thus there difference to be 0. However, as you can see, the difference between groups (both raw dollar value and percentage difference) can in fact be quite large purely as a result of how the players were assigned to Groups.

# Over all simulation runs, plot Group1 Revenue minus Group2 revenue as a raw dollar value
ggplot(group_diff) + 
  geom_histogram(aes(x=delta, y = ..density..), bins=50) + 
  geom_density(aes(x=delta), color='blue') +
  xlab('Revenue Delta between Groups 1 & 2')

# Over all simulation runs, plot Group1 Revenue minus Group2 revenue as a percentage
ggplot(percent_diff) + 
  geom_histogram(aes(x=p_diff, y = ..density..), bins=50) + 
  geom_density(aes(x=p_diff), color='green') + 
  xlab('Percent Difference between Groups 1 & 2')

Total REVENUE across all players: 38495.13

Expected REVENUE for each groups: 19247.57

ACROSS 1000 Random Samples:

Group 1 mean REVENUE: 19247.71 (expected value: 19247.57)
Group 1 median REVENUE: 19251.47 (expected value: 19247.57)
Group 2 mean REVENUE: 19247.71 (expected value: 19247.57)
Group 2 median REVENUE: 19251.47 (expected value: 19247.57)
REVENUE mean Difference ($): -0.28 (expected value: $0)
REVENUE median Difference ($): -7.8 (expected value: $0)

While the means and median look really close to expected across 1000 samples, notice below in the distribution plots, we have a large number of cases where there was up to a $2000 difference between the groups (based on pure random chance). Just looking at an aggregate mean or sum hides the actual variability we might actually see in reality.

Conclusion

To sum up, when running AB Tests, it’s not sufficient to just look at group aggregate totals to determine which group was “the winner”. We see that random assignment can introduce variability that might make one group appear to be better than the other, when in fact, there are no real differences. Given this, it is critical that the person analyzing AB Test results takes the time to perform statistical analysis (e.g. ANOVA, etc) to determine whether an observed difference between groups is statistically significant or possible just due to chance. This simulation also illustrates that small actual differences between groups may be impossible to detect given the variability or noise from the assignment process. In this simulation, up to $\pm 4\%$ difference could be expected from chance meaning we would need $> 4\%$ difference caused by the experimental conditions to reach detection.