2025-06-06

Introduction

A/B testing is widely used in streaming platforms like Spotify, Crunchyroll and Twitch to improve user experience. Whether testing a new autoplay feature or UI layout, hypothesis testing helps teams make data-driven decisions.

What is Hypothesis Testing

Let \(\mu_A\) and \(\mu_B\) be the mean watch times for groups A and B. We test:

\[H_0: \mu_A = \mu_B \quad \text{vs} \quad H_1: \mu_A \ne \mu_B \]

Where: \(H_0\) is the null hypothesis (no effect)

and

\(H_1\) is the alternative hypothesis (effect exists)

Simulated A/B Test

We simulate user watch time (in minutes) for 100 users in group A and 100 users in group B. The table below shows a few sample observations from each group to illustrate the structure of the data.

## # A tibble: 6 × 2
## # Groups:   group [2]
##   watch_time group
##        <dbl> <chr>
## 1       54.4 A    
## 2       57.7 A    
## 3       75.6 A    
## 4       57.9 B    
## 5       67.6 B    
## 6       62.5 B

Distribution of Watch Time

The plot below shows the distribution of watch times for users in groups A and B. Group B, which received the new autoplay feature, appears to have a higher average watch time and a slightly shifted distribution.

t-Test Results

The t-test results are summarized below:

## t-statistic:  -2.2714
## p-value:  0.024201

The p-value helps us deice whether the observed difference in average watch time between groups A and B is statistically significant.

A low p-value (typically less than 0.05) suggests that the observed difference is unlikely to have occurred by chance, and that the new feature (autoplay in this scenario) may have a real effect on user engagement.

Confidence Intervals & Effect Size

The confidence interval for the difference in means is:

\[ CI = (\bar{x}_B - \bar{x}_A) \pm t^* \cdot SE \]

Plotly 3D Plot

This 3D plot shows the relationship between user age, watch time, and A/B group. It provides a visual way to explore how user engagement varies across different demographics and experimental conditions.

R code Example

#Basic t-test between groups
A <- rnorm(100, 60, 10)
B <- rnorm(100, 65, 10)
t.test(A, B)

The rnorm() function generates random values from a normal distribution. In this example, rnorm(100, 60, 10) creates 100 simulated watch times for group A with a mean of 60 minutes and a standard deviation of 10.

Conclusion

The new autoplay feature appears to increase average watch time significantly. This example shows how statistical hypothesis testing and data visualization help improve features on streaming platforms.

Thanks for viewing!