This dataset comprises over 82’000 observations on customers from a wine retailer. Both distributions look quite similar and normally distributed (nice bell shape). This already gives an indication that a t-test could be appropriate to test for differences in average amount purchased, whenever the conditions for such a test are actually met.
Let’s check the average purchase amount according to the email version customers received:
# A tibble: 2 × 2
group Avg_purchase
<fct> <dbl>
1 email_A 202.
2 email_B 124.
Other details:
open click days_since visits
Min. :0.000 Min. :0.0000 Min. : 0.00 Min. : 0.000
1st Qu.:0.000 1st Qu.:0.0000 1st Qu.: 26.00 1st Qu.: 4.000
Median :1.000 Median :0.0000 Median : 63.00 Median : 6.000
Mean :0.685 Mean :0.1125 Mean : 89.99 Mean : 5.944
3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:124.00 3rd Qu.: 7.000
Max. :1.000 Max. :1.0000 Max. :992.00 Max. :51.000
You can now assess whether the average purchase amount depends on the email version in a significant manner. The parametric version to check this is a t-test for comparing means or average amount between two groups. One of the conditions for conducting a t-test is equal variance. The violin plots showed that the distributions appear to have similar variance (i.e. are similarly spread), but you can check that formally with a test. The Levene’s test assesses homogeneity of variance. The p-value of the test (=0.6947) indicates that the result is not statistically significant at the 5% level, which means that there is no significant difference in variance between the groups.
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 1 0.154 0.6947
82656
Another condition for conducting a t-test is normality, which you can detect visually with a histogram, a density plot or a q-q plot of the response variable, purchase amount. The violin plots already showed that the distributions appear to be normal. Formally, you can check for normality with an Anderson-Darling test. The p-value of the distribution for email version A is 0.7765, indicating that the distribution does not statistically differ from a normal distribution at the 5% significance level; the p-value of the distribution for email version B is 0.584, which means that the distribution of this variable for this group can also be assumed to be normal. Thus, a parametric method can be employed and you can go on to apply the t-test.
Anderson-Darling normality test
data: d_treat$purch2[d_treat$group == "email_A"]
A = 0.24006, p-value = 0.7765
Anderson-Darling normality test
data: d_treat$purch2[d_treat$group == "email_B"]
A = 0.29948, p-value = 0.584
Two Sample t-test
data: purch2 by group
t = 301.82, df = 82656, p-value < 2.2e-16
alternative hypothesis: true difference in means between group email_A and group email_B is not equal to 0
95 percent confidence interval:
78.15762 79.17935
sample estimates:
mean in group email_A mean in group email_B
202.4831 123.8146
After running the test you can derive the effect size using Cohen’s d, a standardized measure for difference between two means. An effect size of 0.2 is considered to be small, 0.3 is medium, and 0.5 is large. The effect size here is 0.2 pointing to a small effect and difference in the mean average amount purchased.
Cohen's d | 95% CI
------------------------
2.10 | [2.31, 2.31]
- Estimated using pooled SD.
To derive the power of the test rather than the sample size required, you can use the number of subjects in each group. In this dataset there are 82658, so 41329 in each group. The previously obtained effect size and the A/B test’s p-value are entered in the formula for power calculation and you obtain a result of 1. This means that this test can detect a significant difference in groups 100% of the time, hence there is approximately no chance of deriving an inaccurate result. The high or nearly perfect power indicates the significant difference of the A/B experiment with the t-test result is highly reliable. In order to obtain reliable results but with minimal investment, you could aim at an acceptable power level of 80-90%, which means the sample size needed to detect an effect would be much lower, hence resources can be spared.
Two-sample t test power calculation
n = 41329
d = 2.1
sig.level = 2.2e-16
power = 1
alternative = two.sided
NOTE: n is number in *each* group
In order to achieve a power of 80% (i.e. reliably detect an effect 80% of the time) with the effect size calculated by Cohen’s d, you can run a power analysis calculation. As it turns out, you could achieve acceptable results with 10 customers (5 in each group). Such a low sample size leads to the plausible conclusion that email version A yields excellent results and you do not actually need to send the other email version. Furthermore, the investment done in design and processing of the other email version seems definitely excessive in comparison with the gains you obtain with a much more efficient emailing campaign.
Two-sample t test power calculation
n = 4.742304
d = 2.1
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number in *each* group