Exploratory Data Analysis

This dataset was collected during Covid and comprises 1078 observations on people in India aged 7-59, from which 60% live in Delhi and 40% outside of Delhi. The boxplots show that the distribution is pretty right skewed with several outliers, and the median time spent looks quite similar for both groups. This already gives an indication that a significant difference may not be easily detected, and the type of test you can conduct in such a case.

column=350x

column=350x

Assumptions for A/B Testing

Let’s assess whether the average time people spent on social media varies with the technology used for online class during COVID. The parametric version is a t-test for comparing means. One of the conditions for conducting a t-test is equal variance. The boxplots showed that the distributions appear to have similar variance, but you can check that formally with a test. The Levene’s Test assesses Homogeneity of Variance. The p-value of 0.0863 indicates that the result is not statistically significant, which means that there is no significant difference in variance between the groups.

Levene's Test for Homogeneity of Variance (center = median)
        Df F value Pr(>F)  
group    1  2.9474 0.0863 .
      1076                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Another condition for conducting a t-test is normality, which you can detect with a histogram or density plot of the response variable, time spent on social media. The boxplots already showed that the distributions appear skewed to the right, hence nonnormal. Formally, you can check normality with a Shapiro-Wilk normality test. The p-value of < 0.0001 indicates that the result is highly statistically significant, which means that the distribution of this variable is definitely not normal. Thus, the t-test cannot be applied and you must select a nonparametric method instead.


    Shapiro-Wilk normality test

data:  covid$Time.spent.on.social.media
W = 0.83299, p-value < 2.2e-16

A/B Test: Mann-Whitney U test

Now that you can assume that variances are equal but the distribution of the response variable is not normal, you may instead conduct a nonparametric version of the t-test, such as the Mann-Whitney U test or Wilcoxon rank sum test with continuity correction. The p-value of this test is 0.0061 and it is statistically significant at the 5% significance level, which means that the time spent on social media varies significantly depending on the technology used for online class.


    Wilcoxon rank sum test with continuity correction

data:  Time.spent.on.social.media by Medium.for.online.class
W = 158910, p-value = 0.0061
alternative hypothesis: true location shift is not equal to 0

Effect Size and Power Analysis

After running the test you can derive the effect size for Mann-Whitney U tests using the rank-biserial correlation. A correlation of 0.1 is small, 0.3 is medium, and 0.5 is large. The effect size here is 0.09 pointing to a a small effect and difference in the median time spent on social media with a Laptop/Desktop vs. with a smartphone.

r (rank biserial) |       95% CI
--------------------------------
0.09              | [0.03, 0.16]

To derive the power of the test rather than the sample size required, you can give the number of subjects in each group. In this dataset there are 1078, so 539 in each group. The previously obtained effect size and the A/B test’s p-value are entered in the formula for power calculation and you obtain a result of 0.1029. This means that this test can detect a significant difference in groups only 10% of the time, hence there is approximately a 90% (100 - power) likelihood of deriving an inaccurate result. The low power and high likelihood of an error indicates the significant difference of the A/B experiment with Mann-Whitney U test result is unreliable. In order to obtain reliable results you should definitely increase power, and that can be achieved by increasing sample size, effect size, or both.


     difference of proportion power calculation for binomial distribution (arcsine transformation) 

              h = 0.09
             n1 = 539
             n2 = 539
      sig.level = 0.0061
          power = 0.1029701
    alternative = two.sided

NOTE: different sample sizes

Sample size calculation

In order to increase power to 80% (i.e. reliably detect an effect 80% of the time) with the effect size calculated by the biserial correlation, you can run a power calculation. Given that the correlation or effect size of 0.09 is too low, you can try with a little higher one, for example, 13%, so that the power analysis is able to run and calculate the necessary sample size for you. It turns out that if one of the groups has 539 subjects, the other group should now have 3357 subjects in order to detect a meaningful difference.


     difference of proportion power calculation for binomial distribution (arcsine transformation) 

              h = 0.13
             n1 = 539
             n2 = 3356.93
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: different sample sizes