Reading the clicks dataset
clicks <- read.csv("click_data.csv")
head(clicks)## visit_date clicked_adopt_today
## 1 2017-01-01 1
## 2 2017-01-02 1
## 3 2017-01-03 0
## 4 2017-01-04 1
## 5 2017-01-05 1
## 6 2017-01-06 0
The dataset, which is related to a dog adoption website, has two columns and 3650 rows. The first column shows the date and the second column shows if the adopt today button on the webpage is clicked or not, 0= not clicked and 1= clicked
Find oldest and most recent date
min(as.Date(clicks$visit_date))## Warning in strptime(xx, f <- "%Y-%m-%d", tz = "GMT"): unknown timezone
## 'zone/tz/2018i.1.0/zoneinfo/America/New_York'
## [1] "2017-01-01"
max(as.Date(clicks$visit_date))## [1] "2017-12-31"
So the dataset contains data from 1st January 2017 till 31st December 2017
What works better
In order to perform the A/B testing we need a control, the current photo of a still dog on the website and a test which is the new playful photo of a dog. We want to decide which photo should be kept on the website homepage. Thus, I fell the playful dog photo will result in more ADOPT TODAY clicks i.e. more conversion rates.
Conversion rate is the number of people clicking the ADOPT TODAY button divided by the total people visiting the webpage.
Question: Will changing the homepage photo result in more “ADOPT TODAY” clicks?
Hypothesis: Using a photo of a playful dog will result in more “ADOPT TODAY!” clicks
Dependent variable: Clicked “ADOPT TODAY!” button or not
Independent variable: Homepage photo.
Baseline conversion rates
Baseline conversion rates must be known in order to decide whether the control is better that test or not. Lets find out with the clicks dataset
Yearly conversions
clicks%>% summarise(conversion_rate = mean(clicked_adopt_today))## conversion_rate
## 1 0.2772603
Yearly conversion rate for controal page is around 28%. But maybe there are some months where people adopt more. Lets find out.
Monthly conversions
clicks %>%
group_by(month(visit_date)) %>%
summarise(conversion_rate = mean(clicked_adopt_today))## # A tibble: 12 x 2
## `month(visit_date)` conversion_rate
## <dbl> <dbl>
## 1 1. 0.197
## 2 2. 0.189
## 3 3. 0.145
## 4 4. 0.150
## 5 5. 0.258
## 6 6. 0.333
## 7 7. 0.348
## 8 8. 0.542
## 9 9. 0.293
## 10 10. 0.161
## 11 11. 0.233
## 12 12. 0.465
We see conversion for each month to see if there is any effect of seasonality.
c <- clicks %>%
group_by(month(visit_date)) %>%
summarise(conversion_rate = mean(clicked_adopt_today))#%>%ggplot(., aes(x = `month(visit_date)`, y = conversion_rate)) +
#geom_point() +
#geom_line(color='lightseagreen')
p1 <- plot_ly(c, x = ~`month(visit_date)`, y = ~conversion_rate, type = 'scatter',
mode = 'lines+markers',color =I("lightseagreen") )
p1clicks %>%
group_by(weekdays(as.Date(visit_date))) %>%
summarise(conversion_rate = mean(clicked_adopt_today))%>%arrange(desc(conversion_rate))## # A tibble: 7 x 2
## `weekdays(as.Date(visit_date))` conversion_rate
## <chr> <dbl>
## 1 Sunday 0.300
## 2 Wednesday 0.298
## 3 Monday 0.277
## 4 Thursday 0.271
## 5 Tuesday 0.271
## 6 Friday 0.267
## 7 Saturday 0.256
As we can see the conversion is not changing by much over the weekdays.
clicks %>%
group_by(week(visit_date)) %>%
summarise(conversion_rate = mean(clicked_adopt_today))## # A tibble: 53 x 2
## `week(visit_date)` conversion_rate
## <dbl> <dbl>
## 1 1. 0.229
## 2 2. 0.243
## 3 3. 0.171
## 4 4. 0.129
## 5 5. 0.157
## 6 6. 0.186
## 7 7. 0.257
## 8 8. 0.171
## 9 9. 0.186
## 10 10. 0.200
## # ... with 43 more rows
w<- clicks %>%
group_by(week(visit_date)) %>%
summarise(conversion_rate = mean(clicked_adopt_today))
g <- ggplot(w, aes(x = `week(visit_date)`,y = conversion_rate)) +
geom_point() +
geom_line() + geom_path(color='peru',size = 1)+
scale_y_continuous(limits = c(0, 1),
labels = percent)+xlab('Week')+ylab('Conversion Rate')
p <- plot_ly(w, x = ~`week(visit_date)`, y = ~conversion_rate, type = 'scatter', mode = 'lines+markers')
pThe plot shows the seasonal conversion rates by week of the year
Power Analysis
Statistical test - statistical test you plan to run
Baseline value - value for the current control condition
Desired value - expected value for the test condition
Proportion of the data from the test condition (ideally 0.5)
Significance threshold / alpha - level where effect is significant (generally 0.05)
Power / 1 - Beta - Probability correctly rejecting null hypothesis (generally 0.8)
Number of samples/ data points to run the A/B test
total_sample_size <- SSizeLogisticBin(p1 = 0.2,
p2 = 0.3,
B = 0.5,
alpha = 0.05,
power = 0.8)
total_sample_size## [1] 587
total_sample_size/2## [1] 293.5
Thus we require total 587 data points which means 294 each for test and control
Now after running the experiment and collecting the data, we will try to find out if there is statistical difference between test and control. This data is saved in the experiment data.
Lets load the data and analyze
exp <- read.csv("experiment_data.csv")
head(exp)## visit_date condition clicked_adopt_today
## 1 2018-01-01 control 0
## 2 2018-01-01 control 1
## 3 2018-01-01 control 0
## 4 2018-01-01 control 0
## 5 2018-01-01 test 0
## 6 2018-01-01 test 0
As we can observe, we have one more column in the dataset called condition which tells us if the user is clicking the adopt button from control page or test page.
Finding the conversion rates for the control and test conditon
exp %>%
group_by(condition) %>%
summarise(conversion_rate = mean(clicked_adopt_today))## # A tibble: 2 x 2
## condition conversion_rate
## <fct> <dbl>
## 1 control 0.167
## 2 test 0.384
As observed, the conversion rate for control is around 17% and test is 38%. This difference looks large.
Plotting the control and test conditions
expp <- exp %>%
group_by(day(visit_date), condition) %>%
summarise(conversion_rate = mean(clicked_adopt_today))
ggplot(expp,aes(x = `day(visit_date)`,
y = conversion_rate,
color = condition,
group = condition)) +
geom_point() +
geom_line()Thus from the plot we can observe that the test condition is almost always having a high conversion rate than the control
Statistical analysis
As the dependent variable ‘clicked_adopt_today’ is binary, we will use logistic regression for statistical analysis
glm(clicked_adopt_today ~ condition,
family = "binomial",
data = exp) %>%
tidy()## term estimate std.error statistic p.value
## 1 (Intercept) -1.609438 0.1564922 -10.284464 8.280185e-25
## 2 conditiontest 1.138329 0.1971401 5.774212 7.731397e-09