Exploratory Data Analysis

This dataset comprises over 82’000 observations on customers from a wine retailer. Both distributions look quite similar and normally distributed (nice bell shape). This already gives an indication that a t-test could be appropriate to test for differences in average amount purchased, whenever the conditions for such a test are actually met.

column=300x

column=300x

Summary Statistics

Let’s check the average purchase amount according to the email version customers received:

# A tibble: 2 × 2
  group   Avg_purchase
  <fct>          <dbl>
1 email_A         202.
2 email_B         124.

Other details:

      open           click          days_since         visits      
 Min.   :0.000   Min.   :0.0000   Min.   :  0.00   Min.   : 0.000  
 1st Qu.:0.000   1st Qu.:0.0000   1st Qu.: 26.00   1st Qu.: 4.000  
 Median :1.000   Median :0.0000   Median : 63.00   Median : 6.000  
 Mean   :0.685   Mean   :0.1125   Mean   : 89.99   Mean   : 5.944  
 3rd Qu.:1.000   3rd Qu.:0.0000   3rd Qu.:124.00   3rd Qu.: 7.000  
 Max.   :1.000   Max.   :1.0000   Max.   :992.00   Max.   :51.000  

Assumptions for A/B Testing

You can now assess whether the average purchase amount depends on the email version in a significant manner. The parametric version to check this is a t-test for comparing means or average amount between two groups. One of the conditions for conducting a t-test is equal variance. The violin plots showed that the distributions appear to have similar variance (i.e. are similarly spread), but you can check that formally with a test. The Levene’s test assesses homogeneity of variance. The p-value of the test (=0.6947) indicates that the result is not statistically significant at the 5% level, which means that there is no significant difference in variance between the groups.

Levene's Test for Homogeneity of Variance (center = median)
         Df F value Pr(>F)
group     1   0.154 0.6947
      82656               

Another condition for conducting a t-test is normality, which you can detect visually with a histogram, a density plot or a q-q plot of the response variable, purchase amount. The violin plots already showed that the distributions appear to be normal. Formally, you can check for normality with an Anderson-Darling test. The p-value of the distribution for email version A is 0.7765, indicating that the distribution does not statistically differ from a normal distribution at the 5% significance level; the p-value of the distribution for email version B is 0.584, which means that the distribution of this variable for this group can also be assumed to be normal. Thus, a parametric method can be employed and you can go on to apply the t-test.


    Anderson-Darling normality test

data:  d_treat$purch2[d_treat$group == "email_A"]
A = 0.24006, p-value = 0.7765

    Anderson-Darling normality test

data:  d_treat$purch2[d_treat$group == "email_B"]
A = 0.29948, p-value = 0.584

A/B Test: t-test

Now that you can assume that variances are equal and the distribution of the response variable for both groups is normal, you may conduct a parametric test to compare means: the t-test. The p-value of this test is practically 0 and hence highly statistically significant at the 5% significance level, which means that the average amount purchased definitely varies depending on the email version. The 95% confidence interval (CI : 78.15 - 79.18) indicates that the better version yields a higher purchase amount of about $78-79: While email version A produced $202.48 worth in purchases, version B only achieved $123.81, on average.


    Two Sample t-test

data:  purch2 by group
t = 301.82, df = 82656, p-value < 2.2e-16
alternative hypothesis: true difference in means between group email_A and group email_B is not equal to 0
95 percent confidence interval:
 78.15762 79.17935
sample estimates:
mean in group email_A mean in group email_B 
             202.4831              123.8146 

Effect Size and Power Analysis

After running the test you can derive the effect size using Cohen’s d, a standardized measure for difference between two means. An effect size of 0.2 is considered to be small, 0.3 is medium, and 0.5 is large. The effect size here is 0.2 pointing to a small effect and difference in the mean average amount purchased.

Cohen's d |       95% CI
------------------------
2.10      | [2.31, 2.31]

- Estimated using pooled SD.

To derive the power of the test rather than the sample size required, you can use the number of subjects in each group. In this dataset there are 82658, so 41329 in each group. The previously obtained effect size and the A/B test’s p-value are entered in the formula for power calculation and you obtain a result of 1. This means that this test can detect a significant difference in groups 100% of the time, hence there is approximately no chance of deriving an inaccurate result. The high or nearly perfect power indicates the significant difference of the A/B experiment with the t-test result is highly reliable. In order to obtain reliable results but with minimal investment, you could aim at an acceptable power level of 80-90%, which means the sample size needed to detect an effect would be much lower, hence resources can be spared.


     Two-sample t test power calculation 

              n = 41329
              d = 2.1
      sig.level = 2.2e-16
          power = 1
    alternative = two.sided

NOTE: n is number in *each* group

Sample size calculation

In order to achieve a power of 80% (i.e. reliably detect an effect 80% of the time) with the effect size calculated by Cohen’s d, you can run a power analysis calculation. As it turns out, you could achieve acceptable results with 10 customers (5 in each group). Such a low sample size leads to the plausible conclusion that email version A yields excellent results and you do not actually need to send the other email version. Furthermore, the investment done in design and processing of the other email version seems definitely excessive in comparison with the gains you obtain with a much more efficient emailing campaign.


     Two-sample t test power calculation 

              n = 4.742304
              d = 2.1
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group