AB testing is a framework to test whether one alternative strategy is better at producing a certain effect or achieving the goal. This test will be given into two similar groups and measuring the impact of the design.
AB testing or split test will be started by testing a new idea, running the experiment, statistically analyze the output seeing if the design is significantly different or not. After putting a decision on which strategy is better, the process of AB testing will continue by exploring and updating the new idea. The update here can be used to improve minor updates before taking the final decision. For example, you are running a website about adopting a cat. The first strategy is a homepage showing a cat. You want to execute a different strategy by adding a hat to the cat. With AB testing, we want to know if the alternative strategy, adding a hat to the cat, is a better strategy than a simple cat. One metric to see if an A strategy is better than its counterpart is the conversion rate. If someone visits your website and clicks the button to adopt the cat, the conversion rate adds up. The conversion rate has generally clicked the button divided by the number of people who visited the page. For this case, we need two conditions. The first one is control where your cat is without additional attributes, and the second one is a test; for this case, your additional hat is your test for this condition.
To do AB testing, there are several variables to consider:
Convertion rate of a website is one of a metric for AB testing. Understanding Key Performance Index (KPI) for the business case is important as there are many factors that can be evaluated. Identyfing meaningful KPI is the key of AB testing since AB testing should run the experiment effectively to gain sufficient data.
For this article, we will use a mobile game data named Cookie Cats. This game was created by Tactile Entertainment, where the style of the game is connecting three tiles of the same colour and win the level. The game is filled with a lot of singing cats. Users can see the demo here and see the raw data by clicking here.
library(tidyverse)
<- read_csv("cookie_cats.csv")
cookie_cats glimpse(cookie_cats)
## Rows: 90,189
## Columns: 5
## $ userid <dbl> 116, 337, 377, 483, 488, 540, 1066, 1444, 1574, 1587, 1~
## $ version <chr> "gate_30", "gate_30", "gate_40", "gate_40", "gate_40", ~
## $ sum_gamerounds <dbl> 3, 38, 165, 1, 179, 187, 0, 2, 108, 153, 3, 0, 30, 39, ~
## $ retention_1 <lgl> FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRU~
## $ retention_7 <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, T~
The data consists of 90,189 players during the experiment of AB testing with the following explanation for each attribute:
userid
: a unique number that identifies each player.version
: whether the player was put in the control group (gate_30 - a gate at level 30) or the group with the moved gate (gate_40 - a gate at level 40).sum_gamerounds
: the number of game rounds played by the player during the first 14 days after install.retention_1
: did the player come back and play 1 day after installing?retention_7
: did the player come back and play 7 days after installing?During the progress of the game, players will encounter a gate that forces them to choose between wait a non-trivial amount of time or make an in-app purchase to progress. This event occurs for the purpose of giving players a break from playing the game and to increase and prolong the enjoyment of the game.
The Key Performance Index here will analyze the player retention for two different gate placing. The decision will answer if the gate is rightly placed to prolong user retention of the game. The different placing for gate in level 30 and 40 will be analyze to see the differences. The player will be randonly placed in either gate 30 or gate 40. Gate 30 here will be act as control group while gate 40 will be acts as test group. We will investigate the proportion of each group.
prop.table(table(cookie_cats$version))*100
##
## gate_30 gate_40
## 49.56259 50.43741
The proportion of the control and test group is roughly in the same proportion which is nice! Another thing to consider is checking if there is missing value for each variables.
colSums(is.na(cookie_cats))
## userid version sum_gamerounds retention_1 retention_7
## 0 0 0 0 0
The missing value is not presented in the data. We can move to another part of the analyzing.
To see the effect of gate placement as the busines case, we will see the distribution of number of games played during the first week of playing game.
library(plotly)
<- cookie_cats %>%
fig plot_ly(
x = ~version,
y = ~sum_gamerounds,
split = ~version,
type = 'violin',
box = list(
visible = T
),meanline = list(
visible = T
)
)
<- fig %>%
fig layout(
xaxis = list(
title = "Gate"
),yaxis = list(
title = "Sum gamerounds",
zeroline = F
)
)
fig
From the picture above, we can see that there are many outlier presented in the data. The outlier make the distribution of the data skewed. For example there was a player who played the game for early week for more than 50,000 round in gate_30. For the cleaning we will remove the user who played with more than 40,000 in a week and once again see the distribution.
<- cookie_cats %>%
cookie_cats_clean filter(sum_gamerounds <= 40000)
library(plotly)
<- cookie_cats_clean %>%
fig2 plot_ly(
x = ~version,
y = ~sum_gamerounds,
split = ~version,
type = 'violin',
box = list(
visible = T
),meanline = list(
visible = T
)
)
<- fig2 %>%
fig2 layout(
xaxis = list(
title = "Gate"
),yaxis = list(
title = "Sum gamerounds",
zeroline = F
)
)
fig2
Although the distributin is still skewed but the plot is relatively sensible. Hence, we will keep this data. Next, we will investigate the number of player played specific number of rounds by counting them.
<- cookie_cats_clean %>%
number_of_games count(sum_gamerounds)
number_of_games
There was 3994 player or 4.4% of total registered player who did not play any round. This phenomenom occured with the following reasons:
We will tree to see the distribution of the first 100 rounds played by each user.
library(ggplot2)
library(dplyr)
library(plotly)
library(hrbrthemes)
<- number_of_games %>%
p filter(sum_gamerounds <= 100) %>%
ggplot( aes(x=sum_gamerounds, y=n)) +
geom_area(fill="#69b3a2", alpha=0.5) +
geom_line(color="#69b3a2") +
xlab("Number of rounds")+
ylab("Number of Player") +
ggtitle("Number of Player Played The First 100 Rounds")+
theme_ipsum()
# Turn it interactive with ggplotly
<- ggplotly(p)
p p
The number of player who played more round was decreasing each round. This is understanable with following reason:
Those 3 reasons are a few possible reason. There are many possible reasons existed.
Despite the fact that the number of people played more round was decreasing, we can see that there was many player who played more rounds than the rest. It means that these player is hooked with the game.
A mobile game needs to build player base that keep playing with their game. A metric indicating that a mobile game is succesfull can be analyzed to 1-day retention. 1-day retention is the condition where a player will comeback to play the game after 1 day installing it. The desired number for 1-day retention should be high as the basis of building large player base. We will calculate the percentage of the player who attracted with the game with the indication of 1-day retention:
prop.table(table(cookie_cats_clean$retention_1))*100
##
## FALSE TRUE
## 55.47856 44.52144
44.5% player come back after installing the game for a day. The number is less than the majority of the player who decided to not play the game after 1 day. We can also see the number of 1-day retention for each group of gate.
<- cookie_cats_clean %>%
ratio_per_group1 group_by(version, retention_1) %>%
summarize(count =n()) %>%
mutate(percentage = round(count/sum(count)*100,2)) %>%
ungroup()
ratio_per_group1
For both gate, the number of 1-day retention is similar around 44%. We can also see the retention for 7-day basis.
7-day retention is condition where a player come back to play the game after 7 days installing. Let’s see the proportion of the all gates.
prop.table(table(cookie_cats$retention_7))*100
##
## FALSE TRUE
## 81.39352 18.60648
The retention numbers for 7 days after installing the game is quite far. Many player by the amount of 81% choose to quit the game. One of the cause of problem is player don’t feel the excitement with the game than the first day they install the game or any additional reason.
<- cookie_cats %>%
ratio_per_group7 group_by(version, retention_7) %>%
summarize(count =n()) %>%
mutate(percentage = round(count/sum(count)*100,2)) %>%
ungroup()
ratio_per_group7
For both gate in 7-day retention, the number is similar. Most of the player choose to quit the game for several reasons.
The first thing to see the differences between two groups, we will analyze the distribution for each groups and see how far is the gap for both groups.
The percentage of returned player in a day for gate 30 is slight higher than for gate 40. The score is small since the difference is about 0.6%. To improve our confidence with the difference, we can use bootstrapping to improve our view how small this number affecting the future.
First, we will split the data into two groups. User in control group (gate 30) and in the test group (gate 40) will be assigned to their respective group.
<- cookie_cats_clean %>%
cookie_cats_clean_30 ::filter(version == "gate_30")
dplyr
<- cookie_cats_clean %>%
cookie_cats_clean_40 ::filter(version == "gate_40") dplyr
The bootstrapping procedure will replicate the data with replacement. To gain more confidence the iteration will be replicated to 10,000 sample.
<- 10000
BOOT <- NULL
new_data30 <- NULL
new_data40 set.seed(9999)
for(i in 1:BOOT)
{<- length(cookie_cats_clean_30$retention_1)
n30 <- cookie_cats_clean_30[sample(1:n30,replace=TRUE),]
bootmarks30 <- sum(bootmarks30$retention_1 == 'TRUE')/n30
new_number30
<- c(new_data30,new_number30)
new_data30
}
for(i in 1:BOOT)
{<- length(cookie_cats_clean_40$retention_1)
n40 <- cookie_cats_clean_40[sample(1:n40,replace=TRUE),]
bootmarks40 <- sum(bootmarks40$retention_1 == 'TRUE')/n40
new_number40
<- c(new_data40,new_number40)
new_data40 }
After that, we can see the distribution of two groups with ggplot2
function.
library(ggplot2)
library(hrbrthemes)
library(viridis)
<- data.frame(
data_retention_1 gate = c( rep("Gate 30", length(new_data30)), rep("Gate 40", length(new_data40)) ),
value = c(new_data30, new_data40)
)
%>%
data_retention_1 ggplot( aes(x=as.numeric(value), fill=gate)) +
geom_density( color="#e9ecef", alpha=0.7) +
scale_fill_manual(values=c("#69b3a2", "#404080")) +
xlab("1-Day Retention Rate")+
ggtitle("One Day Retention Rate Distribution")+
theme_ipsum() +
labs(fill="")
From the plot above, we can see the difference between the two groups. We can also see the distribution from each group by seeing the difference in percentage. To do this, we can make a new column consisting of the subtraction between the mean of gate 30 and gate 40 and taking its absolute value.
<- as.data.frame(cbind(new_data30, new_data40)) %>%
data_difference1 mutate(diff1 = round(abs(new_data30 - new_data40)*100,2))
head(data_difference1)
We can also use the same technique to see the distribution of the difference.
%>%
data_difference1 ggplot( aes(x=diff1)) +
geom_density( color="#e9ecef", fill = "#c90076", alpha=0.7) +
scale_fill_manual(values="#8fce00") +
xlab("1-Day Retention Rate Diff")+
ggtitle("One Day Gap Retention Rate Distribution")+
theme_ipsum() +
labs(fill="")
From the plot above, the difference between control and test group lies between 0 - 2%. The highest difference between 0.5% until 0.75%. To sum up, let’s see the probability of difference that is not 0%. We do this to see the effect of gate placement for each level.
%>%
data_difference1 count(diff1 == 0)
From the result above, the number of zero effect for gate placement is small compared to the non-zero percentage difference. The 10,000 replication with bootstrapping give additional information how large the difference is.
The 1-day retention analysis shows that the retention rate is higher when the gate is placed in the level 30. A question may arise “Are you sure the player have been reached the level 30 by just playing in one day?”. This question is crucial since many player are not affected by level 30 gate after playing in one day. The next step, we have to consider 7-day retention as the player might reach the level 30 or the level 40. The placement of the gate have been given to the both groups. The early analysis shows that the gate 30 have more 7-day retention than gate 40.
ratio_per_group7
The difference is wider than in 1-day retention by approximately 1.18%. Let’s visualize the distribution of the data by using bootstraping.
<- 10000
BOOT <- NULL
new_data30_7 <- NULL
new_data40_7 set.seed(9999)
for(i in 1:BOOT)
{<- length(cookie_cats_clean_30$retention_7)
n30_7 <- cookie_cats_clean_30[sample(1:n30_7,replace=TRUE),]
bootmarks30_7 <- sum(bootmarks30_7$retention_7 == 'TRUE')/n30_7
new_number30_7
<- c(new_data30_7,new_number30_7)
new_data30_7
}
for(i in 1:BOOT)
{<- length(cookie_cats_clean_40$retention_7)
n40_7 <- cookie_cats_clean_40[sample(1:n40_7,replace=TRUE),]
bootmarks40_7 <- sum(bootmarks40_7$retention_7 == 'TRUE')/n40
new_number40_7
<- c(new_data40_7,new_number40_7)
new_data40_7 }
<- data.frame(
data_retention_7 gate = c( rep("Gate 30", length(new_data30_7)), rep("Gate 40", length(new_data40_7)) ),
value = c(new_data30_7, new_data40_7)
)
%>%
data_retention_7 ggplot( aes(x=as.numeric(value), fill=gate)) +
geom_density( color="#e9ecef", alpha=0.7) +
scale_fill_manual(values=c("#69b3a2", "#404080")) +
xlab("7-Day Retention Rate") +
ggtitle("Seven Days Retention Rate Distribution")+
theme_ipsum() +
labs(fill="")
The gap between the two groups is wider for 7-day retention. It shows that the retention rate for gate 30 is higher than in the gate 40. We also can see the percentage of the difference by calculating the gap between two groups.
<- as.data.frame(cbind(new_data30_7, new_data40_7)) %>%
data_difference7 mutate(diff1 = round(abs(new_data30_7 - new_data40_7)*100,2))
head(data_difference7)
%>%
data_difference7 ggplot( aes(x=diff1)) +
geom_density( color="#a64d79", fill = "#c90076", alpha=0.7) +
scale_fill_manual(values="#8fce00") +
xlab("7-Day Retention Rate Diff")+
ggtitle("Seven Days Gap Retention Rate Distribution")+
theme_ipsum() +
labs(fill="")
The distribution of the 7-day retention rate gap is between 0.75% until 1%. It shows clear gap between the gate in level 30 and the gate in level 40.
There are three things to be included in AB Testing. The signifcance of the difference between control and test group, confidence interval, and the statistical power of the experiment. The three things here is crucial how good the AB testing has been run and how we perceive new idea in the future.
We have analyzed the gap between control and test group earlier by seeing the distribution. To gain more confidence how sure the gap is, we will use statistical test to see if test group is better than the control group or not. To do this, we will use chi square test with the threshold 0.05. The desired value for both 1-day retention rate and 7-day retention rate should be lower than 0.05. The hyptohesis for the significance test is describe as follow:
\(H_{0}\): There is no significance different between control and test group
\(H_{1}\): There is a significance different between control and test group
Chi-squared Test for 1-day retention rate is written as follow:
<- table(cookie_cats_clean$version, cookie_cats_clean$retention_1)
chi_square1
chisq.test(chi_square1)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: chi_square1
## X-squared = 3.1698, df = 1, p-value = 0.07501
While the Chi-squared test for 7-day retention rate is calculated as follow:
<- table(cookie_cats_clean$version, cookie_cats_clean$retention_7)
chi_square7
chisq.test(chi_square7)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: chi_square7
## X-squared = 9.9153, df = 1, p-value = 0.001639
From the two results above, the desired p-value happens in the 7-day retention rate while in the 1-day retention the p-value is higher than 0.05. The interpretation for the p-value in 1-day retention is the difference in control and test group is not significane at a 5% significane level. The reason behind the insignificane effect of this test is player might not reach the level 30 by playing in 1 day. If the player have not reached the level 30, they will not feel the effect of the gate placement. The retention rate in 7 days is significane at a 5% significane level. It shows that the retention rate in 7 days for gate 30 and gate 40 is significance at a 5% significance level.
We have known the gap between control and test group by simply subtracting one value from another. For example, in 1-day retention rate the difference between the two groups is 44.82% - 44.23% = 0.59% while in the 7-day retention rate the difference is 19.02% - 18.20% = 1.82%. By using single number for each condition, we can also calculate the confidence interval. Confidence interval will give more information about the range of the effect of the variant which is the test group.
library(knitr)
include_graphics("confidence_interval.jpg")
The explanation of the notaion above is p is the probability for the control and test/variant group. Control is the gate placement in gate 30 while variant is the gate placement in the level 40. Here, we will use the retention rate that we have gain earlier. X and N is the number of observed variable and population. Xv is the notation of population of the variant group, Xc is the notation of population of the control group. m is the representation of the magnitude of the confidence interval. The magnitude here will be used to define the range of the confidence interval. The d head and p head is the representation of predicted value respectively. We will use 1.96 as the convertion score for 95% confidence interval.
<- 0.4423 - 0.4482
d
<- (20034 + 20119)/(24665 + 20034 + 25370 + 20119)
p_pool
<- sqrt(p_pool * (1 - p_pool) * (1/(24665 + 20034) + 1/ (25370 + 20119)))
se_pool
<- 1.96 * se_pool
m
<- d - m
ci_retention1_lower <- d + m
ci_retention1_upper
*100 ci_retention1_lower
## [1] -1.238747
*100 ci_retention1_upper
## [1] 0.05874727
We got confidence interval between -1.23% until 0.05%. This range shows 95% of the time the new variant which is the placement of gate 40 will add the retention rate between -1.23% until 0.05%. We see here the point 0% is between the range, hence it is not worth into production by only seeing the retention rate in one day. There might be no effect on the experiment.
We can do the same calculation with 7-day retention rate.
<- 18.20 - 19.02
d7
<- (8502 + 8279)/(36198 + 8502 + 37210 + 8279)
p_pool7
<- sqrt(p_pool7 * (1 - p_pool7) * (1/(36198 + 8502) + 1/(37210 + 8279)))
se_pool7
<- 1.96 * se_pool7
m7
<- d7 - m7
ci_retention1_lower7 <- d7 + m7
ci_retention1_upper7
ci_retention1_lower7
## [1] -0.8250799
ci_retention1_upper7
## [1] -0.8149201
In the 7-day retention rate, the confidence interval shows we expect the placement in gate 40 95% of the time will drop the retention rate with the range -0.825% until -0.814%.
Statistical power is the probability of the experiment of detecting a “true” effect when the effect is actually exist. To calculate the statistical power of one experiment, we can use power analysis with the following formula:
include_graphics("power_analysis.jpg")
d is the notation how large the improvement we consider meaningful in the experiment. Suppose for the retention rate, we will take focus on 1% change improvement for different gate and confidence is 95%. We have seen the effect in 1-day retention shows insignificant effect between the gate placement. Here we will focus on the 7-day retention rate. The calculation to see the power analysis will proceed as follow
<- sqrt(0.1902*(1-0.1902))
theta
<- (sqrt(min(24665 + 20034, 25370 + 20119)) * 0.01)/(2 * theta) - 1.96
z
pnorm(z)
## [1] 0.768388
The power of the experiment is 76%. The interpretation for this number is the likelihood of the experiment to be able to detect a non-zero effect, here is the placement gate, if the effect is truly exist is 76%. As the rule of thumb of good experiment is 80%, our calculation shows that the experiment is strong enough to detect the effect of gate placement after 7 days of playing the game.
There is a significant difference between the placement in gate 30 and gate 40 and to see the retention rate, we need to wait for 7 Days as the player will not reach gate 30 for playing in one day. The recommendation for the business is If we want to keep the retention higher, the gate should be placed in level 30 and not moving it to level 40.
The Function of the gate placement is to increase the engagement of the player. If the gate is placed earlier, there is an indication that early gate placement will prolong the engagement of the game to the player. The more level the gate is, the more obstacle that the player will face. The player will get less enjoyment over activity that is taken continuosly. The player might get bored with continous activity of the game, hence giving break by puting the gate earlier will increase the chance of player to play again in the next time.
https://stats.idre.ucla.edu/other/mult-pkg/seminars/intro-power/
https://medium.com/bukalapak-data/3-things-to-report-in-an-a-b-test-analysis-dd00fa28a97d
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences, Second Edition. Hillsdale, New Jersey: Lawrence Erlbaum Associates.