A|B Testing

Author

Richmond Silvanus Baye

Published

June 3, 2025

Introduction

An A/B test is like a science experiment. You split people into two groups. One group uses the regular version of a website or app (that’s the control group). The other group tries out a new version with something changed (that’s the test group).

Then you watch what happens. Do people in the new version click more? Buy more? If they do, the new version might be better. It’s a simple way to figure out what changes work best.

Loading the packages

Now, with the required packages, we can load the data into R for our analysis.

Conversion rate metrics

We need to create some variables such as the click through rate, the purchase rate and the return of intervention.

Conversion Funnel Drop-Off

The funnel plot shows how users from each group moved through key steps like impressions, clicks, and purchases. While the control group had more total impressions, the test group performed competitively at later stages like clicks.

Time-Series Trends by Group

Having show how user move from each group, we need to focus on purchases. The plot above shows how purchase rates varied daily for each group. The test group (blue) consistently outperforms the control group (red) on many days, with sharper spikes.

Segment-Level Comparison (CTR split)

We also visualized the purchase rate by click through rate. As shown above, the plot compares purchase rates across users with low vs. high click-through rates (CTR). The test group shows a noticeably higher purchase rate among high-CTR users. In contrast, the difference is much smaller for low-CTR users.

Spend Comparison Visualizations

Uplift Calculation

[1] 0.6487367

Before implementing the A/B test, we assess if the purchase rate following the intervention increased. Our analysis showed that the mean value of purchases for the test group increased by 65% and this was higher than the control group . This is a substantial increase and suggests the test variant had a strong positive impact on user behavior.

Basic A/B Test – Mean Comparisons

The t-test compares average purchase rates between the test and control groups. A statistically significant result would indicate that the change had a measurable impact. the p-value in the output: is less than 0.05, suggesting the difference is unlikely due to chance.


    Welch Two Sample t-test

data:  purchase_rate by group
t = -3.1475, df = 41.361, p-value = 0.003051
alternative hypothesis: true difference in means between group control and group test is not equal to 0
95 percent confidence interval:
 -0.005441621 -0.001188596
sample estimates:
mean in group control    mean in group test 
          0.005110098           0.008425207

Estimating the mean group comparison shows that the test group has a higher median purchase rate and more variability than the control group. The test group had very high purchase rate, suggesting that the change may be effective. To confirm if this difference is statistically significant, we rely on the Wilcoxon test. Next we implenment that

2. Non-Parametric Test


    Wilcoxon rank sum exact test

data:  purchase_rate by group
W = 261, p-value = 0.004757
alternative hypothesis: true location shift is not equal to 0

Our non parametric test shows that there is a statistically significant difference in purchase rates between the control and test groups.

3. Effect Size (Cohen’s d)


Cohen's d

d estimate: -0.8126839 (large)
95 percent confidence interval:
     lower      upper 
-1.3504365 -0.2749313

We also implement Cohen’s d to examine the effect size. the Cohen’s d = -0.81 indicating a large effect size, meaning the difference in purchase rates between the test and control groups is not only statistically significant but also practically meaningful. The negative sign simply reflects the direction (test > control).

A natural question is what happens when we control for other factors such as reach and the number of clicks on the website? To do that, we implement a line regression controlling for reach and the number of clicks on the website.

4. Regression-Adjusted Estimation


Call:
lm(formula = purchase_rate ~ group + reach + number_of_website_clicks, 
    data = ab_data)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.006541 -0.002240 -0.000463  0.002070  0.010652 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               1.074e-02  2.516e-03   4.270 7.64e-05 ***
grouptest                 5.079e-04  1.115e-03   0.456 0.650476    
reach                    -7.740e-08  1.866e-08  -4.149 0.000115 ***
number_of_website_clicks  2.180e-07  2.755e-07   0.791 0.432112    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.003562 on 56 degrees of freedom
Multiple R-squared:  0.3713,    Adjusted R-squared:  0.3376 
F-statistic: 11.02 on 3 and 56 DF,  p-value: 8.623e-06

The regression model explains about 34% of the variance in purchase rates (Adjusted R² = 0.34). The test group does not show a statistically significant effect after adjusting for reach and number of website clicks (p = 0.65). However, reach has a significant negative effect (p < 0.001), suggesting that as reach increases, the purchase rate tends to decrease slightly.

Normality Check


    Shapiro-Wilk normality test

data:  ab_data$purchase_rate[ab_data$group == "control"]
W = 0.93519, p-value = 0.06754


    Shapiro-Wilk normality test

data:  ab_data$purchase_rate[ab_data$group == "test"]
W = 0.88385, p-value = 0.003459

We tested for normality of the control and the test data. Our results show that the data is not normal at 5% level of significance, as such, implementing the non-parametric regression for both the control and the test data.

Someone might ask, wow sure are we that the test group really outperforms the control group in purchase rate?” To answer this, we implement the bootstrapped confidence interval

Bootstrapping Confidence Intervals

# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1  0.00137  0.00549

The histogram shows the distribution of differences in average purchase rates between the test and control groups across 1,000 bootstrap samples. Since this distribution is centered above zero and the 95% confidence interval does not include zero, it suggests the test group likely has a real positive effect on purchase rate. Bootstrapping offers a robust way to estimate uncertainty without relying on normality assumptions.

Bayesian A/B Test


Model Info:
 function:     stan_glm
 family:       gaussian [identity]
 formula:      purchase_rate ~ group + reach + number_of_website_clicks
 algorithm:    sampling
 sample:       4000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 60
 predictors:   4

Estimates:
                           mean   sd   10%   50%   90%
(Intercept)              0.0    0.0  0.0   0.0   0.0  
grouptest                0.0    0.0  0.0   0.0   0.0  
reach                    0.0    0.0  0.0   0.0   0.0  
number_of_website_clicks 0.0    0.0  0.0   0.0   0.0  
sigma                    0.0    0.0  0.0   0.0   0.0  

Fit Diagnostics:
           mean   sd   10%   50%   90%
mean_PPD 0.0    0.0  0.0   0.0   0.0  

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
                         mcse Rhat n_eff
(Intercept)              0.0  1.0  2150 
grouptest                0.0  1.0  2041 
reach                    0.0  1.0  2118 
number_of_website_clicks 0.0  1.0  2483 
sigma                    0.0  1.0  2416 
mean_PPD                 0.0  1.0  3786 
log-posterior            0.0  1.0  1436 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

The Bayesian model provides a full distribution of possible effect sizes rather than a single estimate. In this output, all parameter estimates are close to zero, and the credible intervals (90%) are extremely narrow and centered around zero. This suggests the model does not find strong evidence that the test group or any predictors significantly affect the purchase rate.

What is important here is, all Rhat values are 1, indicating the Bayesian chains converged well. This means the model ran correctly, even though it didn’t find meaningful effects.

The plot shows the estimated effect of each variable in the model (e.g., grouptest, reach) along with uncertainty.

Each point is the mean estimate of the effect.
The horizontal lines are credible intervals (uncertainty range).
If the interval crosses zero, it means the model is uncertain about whether the effect is truly positive or negative.

In this case, the grouptest estimate is very close to zero and its interval overlaps zero, meaning the Bayesian model found no strong evidence that the test group had a different purchase rate than the control.

Conclusion

While the frequentist methods—including t-tests, bootstrapping, and effect size—identified a statistically and practically significant improvement in purchase rate for the test group, the Bayesian model did not detect a clear difference. This discrepancy highlights the importance of using multiple approaches to validate business impact.

From a business perspective, the test group achieved a 65% higher purchase rate, especially among highly engaged users (those with high click-through rates). Even though Bayesian analysis was conservative in its estimates, the consistent uplift across visualizations and funnel performance suggests the new variant has real potential to drive conversions.

Recommendation

The test variant is a strong candidate for broader rollout, particularly if paired with targeted delivery to high-engagement user segments. Future experiments could focus on optimizing elements that drive this engagement.