A/B Testing; Final Project of Udacity Course

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

library(parallel)

1.Introduction

This report is the final project of A/B Testing Course of Udacity. The structure of the report follows the suggested structure of the Udacity for the project report, with some further explanations where ever it is needed.

The project is about an A/B Testing, not surprisingly, an experiment that Udacity, an e-learning platform, wants to do on the users of its courses.

More specifically, an experimental step would be added after the “Start Free Trial” button on the course page. Based on this experiment, after clicking on the “Start Free Trial” button, a message would be shown asking users the weekly time that they want to invest on the course. If the input hours is below 5 hours per week, “a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggestiong that the student might like to access the course materials for free.”

The goal of the experiment is reducing the number of students who would quit the course during the starting 14days period, while leaving the number of studnets who pass this 14days period intact. I would like to call the first group of students, “frustrated”, and the second group, “resolute” students.

I have provided the following image based on the provided data on Final Project Instructions and Baseline Values

Udacity Project

Further, from the instructions we learn:

The hypothesis was that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn’t have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, Udacity could improve the overall student experience and improve coaches’ capacity to support students who are likely to complete the course.

The unit of diversion is a cookie, although if the student enrolls in the free trial, they are tracked by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page.

2.Metric Choice

We need two groups of metrics. One group for measuring the effect of change that is imposed, and one group for sanity check, i.e. validation of the test.

Number of Cookies: That is, number of unique cookies to view the course overview page. (dmin=3000) - Definitely this metric is not affected by the experiment, so this cannot be an evaluation metric. However, it can be an “invariant metric”, since the number of cookies should be more or less the identical in both experiment and control groups.
Number of user-ids: That is, number of users who enroll in the free trial. (dmin=50) - This can be an evaluation metric, since we expect the number of students who enroll remains the same.
Number of clicks: That is, number of unique cookies to click the “Start free trial” button (which happens before the free trial screener is trigger). (dmin=240) - This metric cannot be an evaluation measure since at this point of the process, the change is not affected the users yet. It can be an invariant metric.
Click-through-probability: That is, number of unique cookies to click the “Start free trial” button divided by number of unique cookies to view the course overview page. (dmin=0.01) - An invariant metric can it be.
Gross conversion: That is, number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the “Start free trial” button. (dmin= 0.01) - It can be an evaluation metric. The value should be reduced in the experiment group since the number of enrollments is expected to decrease.
Retention: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. (dmin=0.01) - This can be an evaluation metric. Since we expect that the number of enrollment reduces, and the number of payments remains more or less the same, this metric is expected to be more in the experiment group comparing to the control group.
Net conversion: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the “Start free trial” button. (dmin= 0.0075) - This can be an evaluation metric. The number of clicks is unaffected by the change, the number of payments is expected to remain more or less the same, so the change of this metric is expected to be insignificant.

In order to choose the evaluation metric, we should remember what we wanted to evaluate. First, the number of “frustrated” stduents, and second, the number of “resolute” students.

The number of user-ids would be lower in the experiment group, but what does it measure? The possible reduction can be related to both “frustrated” and “resolute” groups. So this metric is not distinctive. The Gross Conversion can measure the “frustrated” students. Both the unit of diversion, and the unit of analysis are different in this metric. The Net Conversion can measure the “resolute” students. This is a must have metric. The Retention can measure the “frustrated” students, hand-in-hand with the net conversion.

So the last three can be chosen as evaluation metircs, and I choose all three now, but the required number of observation may change this set.

3.Measure of Variability

It is asked in the template to calculate the variability of the metrics, i.e. standard deviation of them. It is probably the “standard deviation of the sampling distribution” or standard error(SE), rather than the simple SD. The standard error of the current metrics depends on the sample size, and can be computed either analytically or emprically. I would rather go emprically but it is insisted to use analytical method.

Since the daily unique cookies of the overview page is 5000, we need to figure out sample size for each metric. The sample size is equal to the denominator quantity, assuming 5000 unique cookies for the overview page.

The formula which is used here for estimation of the standard error is based on CLT, and the fact that sampling distribution of the metrics has normal distribution.

p_enroll_click <- 0.20625
n <- 400
SE_gross_conversion <- sqrt(p_enroll_click*(1-p_enroll_click)/n)
SE_gross_conversion

## [1] 0.0202306

p_payment_enroll <- 0.53
n <- 82.5 
SE_retention <- sqrt(p_payment_enroll*(1-p_payment_enroll)/n)
SE_retention

## [1] 0.05494901

p_payment_click <- 0.1093125
n <- 400
SE_net_conversion <- sqrt(p_payment_click*(1-p_payment_click)/n)
SE_net_conversion

## [1] 0.01560154

Still, there are questions. If we have the ratios from historic data, i.e. retrospective studies, we should have the standard deviation of the metric in the theoretical population, shouldn’t we? Moreover, the theoretical distribution may not hold, so it is better to gather data and use bootstrapping to estimate these standard deviations, or better to say, standard errors.

At last, why do we need these values if we are not supposed to use them?

4.Sizing

This is the more interesting part of the report. In any experiment design, determination of the sample size is a fundamental step. Based on the distribution of the metric, the sample size is a function of some parameters such as alpha, i.e. significance level, beta, false negative probability or type II error, baseline conversion, practical significance and so.

If we use this sample-size calculator, then the sample size for each metric would be as follows:

Metric	Sample Size
Net Conversion	27,413
Retention	39,115
Gross Conversion	25,835

These are the sample-sizes of one group of the experiment, so for the whole experiment we need to double each digit.

There are two points here. First, I did not adjust significant level based on the fact that we have multiple metrics. Bonferroni is one approach to such adjustment, and there are more approaches to do so. I did not adjust, because these metrics are closely related to each other, and Bonferroni would be too conservative, cause inflamation of the sample sizes. Even now these sample sizes are very high, as we will see, when we consider the unique cookie number of overview page visitors. It is possible to reduce the alpha in order to reduce the false positive probability of the whole test anyway.

The second point is about the previous step, where variability of the metrics were calculated. Did we use those variabilities? No! Here using the sample-size calculator is simplification of the problem. It is possible to use the provided functions for incorporation of the standard error.

## Strategy: For a bunch of Ns, compute the z_star by achieving desired alpha, then
## compute what beta would be for that N using the acquired z_star. 
## Pick the smallest N at which beta crosses the desired value

# Inputs:
#   The desired alpha for a two-tailed test
# Returns: The z-critical value
get_z_star = function(alpha) {
    return(-qnorm(alpha / 2))
}

alpha <- 0.05
retention_z_star <- get_z_star(alpha = alpha)
net_conversion_z_star <- get_z_star(alpha = alpha)
gross_conversion_z_star <- get_z_star(alpha = alpha)

# Inputs:
#   z-star: The z-critical value
#   s: The standard error of the metric at N=1
#   d_min: The practical significance level
#   N: The sample size of each group of the experiment
# Returns: The beta value of the two-tailed test
get_beta = function(z_star, s, d_min, N) {
    SE = s /  sqrt(N)
    return(pnorm(z_star * SE, mean=d_min, sd=SE))
}



# Inputs:
#   s: The standard error of the metric with N=1 in each group
#   d_min: The practical significance level
#   Ns: The sample sizes to try
#   alpha: The desired alpha level of the test
#   beta: The desired beta level of the test
# Returns: The smallest N out of the given Ns that will achieve the desired
#          beta. There should be at least N samples in each group of the experiment.
#          If none of the given Ns will work, returns -1. N is the number of
#          samples in each group.

required_size = function(s, d_min, Ns=1:500000, alpha=0.05, beta=0.2) {
    for (N in Ns) {
        if (get_beta(get_z_star(alpha), s, d_min, N) <= beta) {
            return(N)
        }
    }
    
    return(-1)
}

#for retention 
d_min_retention <- 0.01
p_payment_enroll <- 0.53
n <- 1
SE_retention <- sqrt(p_payment_enroll*(1-p_payment_enroll)/n)
retention_req_size <- required_size(s = SE_retention, 
                                    d_min = d_min_retention, 
                                    alpha = alpha,
                                    beta = 0.2)
retention_req_size

## [1] 19552

# For net conversion
d_min_net_conversion <- 0.0075
p_payment_click <- 0.1093125
n <- 1
SE_net_conversion <- sqrt(p_payment_click*(1-p_payment_click)/n)
net_conversion_req_size <- required_size(s = SE_net_conversion, 
                                    d_min = d_min_net_conversion, 
                                    alpha = alpha,
                                    beta = 0.2)
net_conversion_req_size

## [1] 13586

# For gross conversion
d_min_gross_conversion <- 0.01 
p_enroll_click <- 0.20625
n <- 1 
SE_gross_conversion <- sqrt(p_enroll_click*(1-p_enroll_click)/n)
gross_conversion_req_size <- required_size(s = SE_gross_conversion, 
                                    d_min = d_min_gross_conversion, 
                                    alpha = alpha,
                                    beta = 0.2)
gross_conversion_req_size

## [1] 12850

(2* 9780 / 0.20625) / 0.08

## [1] 1185455

685326/(40000*0.6)

## [1] 28.55525

As we can see, the sample sizes are shrunk! Which one to use?

If I use the former calculations, the number of unique cookies needed for each experiment, both groups, is as follows:

Metric	Sample Size	Required Overview Cookies
Net Conversion	27,413	685,325
Retention	39,115	4,741,212
Gross Conversion	25,835	645,876

To me, it seems that using the functions with SE would yield more reasonable results for sample sizes. Based on the above table, over 4 milion unique cookies are needed for the experiment. Considering 40,000 unique cookies per day for Udacity, as stated in Baseline Values, we need almost 119 days for this experiment. Too long!

We need to reduce this duration, considering the fact that we do not want to guide 100% of the traffic to the experiment. The reason is, the experiment may cause some unexpected side-effects, so it is better not to expose all traffic to it.

We can loosen the power of the test, and increase the alpha for the retention, since this metric is our bottleneck. Or increase the practical significance to 2%, instead of the default 1%. Then the total unique cookies required would be 1,185,455, and the minimum days would be reduced to 30.

Increasing the practical significance means that even though the change in the metric is statistically significant and practically significant based on the old criterion, now we deem it as insignificant. A good decision?

To me it is worth trying, but the provided data about the traffic is insufficient for keeping the retention. So we drop the retention, and continue with the net and gross conversion metrics.

If we drop the retention metric, then the net conversion would rule the sample-sizing, and the experiment needs at least 18 days. With 60% traffic, it would be 29 days. (Btw, the platform of this project at Udacity is disappointingly weak, it does not accept 685,326 as the required overview page, but it accepts 685,325 )

5.Sanity Check

The first step in validation of the experiment is assessment of invariant metrics. There are some metrics that are expected to have more or less identical values in the both experiment and control groups.

From the metrics that we have, I have chosen the number of cookies, number of clicks, and click-through probability as invariant metrics. Now using this data, I check whether the values of these metrics are significantly different in the experiment and control groups.

control <- read.csv("/Users/Shaahin/Downloads/Final\ Project\ Results\ -\ Control.csv")

experiment <- read.csv("/Users/Shaahin/Downloads/Final\ Project\ Results\ -\ Experiment.csv")

For the counts metrics, we assumed that 50% of the experiment traffic goes to the experiment group and 50% goes to the control group. If we call these two groups success and failure, then the model can be a Bernoulli distribution. So I would check whether the current counts of these two groups can come from a population with 0.5 change of success or failure. And I do it using bootstrapping to estimate and build confidence interval.

Null Hypothesis: Status quo. Any difference between the metric value of the two groups is due to chance. Alternate Hypothesis: The difference between the metric value of the two groups is meaningful, and significant. It cannot be due to random change.

It is said in the course that we should calculate the fraction of the control group on the total. It can be the difference of the numbers of the two groups, or relative size of each group to another one. There are different ways anyway.

In the case of calculation of the

alpha <- 0.05

experiment_pageviews <- experiment$Pageviews
total_exp_pview <- sum(experiment_pageviews)
total_exp_pview

## [1] 344660

control_pageviews <- control$Pageviews
total_cntl_pview <- sum(control_pageviews)
total_cntl_pview

## [1] 345543

observed_ratio<- total_cntl_pview/
        (total_exp_pview+total_cntl_pview)
observed_ratio

## [1] 0.5006397

pool <- c(rep(x= 1,total_cntl_pview),rep(0,total_exp_pview))
ratio_vector <- vector(length=10000)

#sum(pool)/length(pool)

# Calculation Using for-loop
#----------------------------
# for (i in 1:10000){
#         pool_resample <- sample(pool,
#                                 size = length(pool),
#                                 replace = TRUE)
#         
#         ratio_vector[i] <- sum(pool_resample)/length(pool)
# }
# 
# hist(ratio_vector)
# abline(v = observed_ratio)
# --------------------------
# Calculation Using lapply()
#----------------------------
# t<- lapply(1:10000 , function(i){
#         pool_resample <- sample(pool,
#                                 size = length(pool),
#                                 replace = TRUE)
#         
#         ratio_vector[i] <- sum(pool_resample)/length(pool)
#         #ratio_vector[i]
# })
# 
# t <- unlist(t)
# hist(t)
# -----------------------------
#Calculation using parallel computing 
#----------------------------

# Calculate the number of cores
no_cores <- detectCores() - 1

# Initiate cluster
cl <- makeCluster(no_cores)



t_parallel<- mclapply(X = 1:10000 , FUN = function(i){
        pool_resample <- sample(pool,
                                size = length(pool),
                                replace = TRUE)
        
        ratio_vector[i] <- sum(pool_resample)/length(pool)
        #ratio_vector[i]
} )

t_parallel <- unlist(t_parallel)
t_parallel <- sort(t_parallel,decreasing = FALSE)
lower_bound <- t_parallel[10000*alpha/2]
upper_bound <- t_parallel[10000 - (10000*alpha/2)]

hist(t_parallel, main = "Proportion of #cookies" )
abline(v =0.5 , col = "blue" )
abline(v = lower_bound, col = "red")
abline(v = upper_bound, col = "red")

As it can be seen, the number of cookies in each experiment group passess the significance test. In other words, the difference between the number of cookies in the control and experiment groups is due to random chance, and it is not significance.

Now we can check the significance of number of clicks in the exact same manner.

alpha <- 0.05

experiment_clicks <- experiment$Clicks
total_exp_clicks <- sum(experiment_clicks)
total_exp_clicks

## [1] 28325

control_clicks <- control$Clicks
total_cntl_clicks <- sum(control_clicks)
total_cntl_clicks

## [1] 28378

observed_ratio_clicks <- total_cntl_clicks/
       ( total_cntl_clicks+total_exp_clicks)

observed_ratio_clicks

## [1] 0.5004673

Here again we can use bootstrapping to check whether the above value is significantly different from what is expected to be seen, i.e. 0.5.

pool_clicks <- c(rep(x= 1,total_cntl_clicks),
                 rep(0,total_exp_clicks))
ratio_vector <- vector(length=10000)


t_parallel<- mclapply(X = 1:10000 , FUN = function(i){
        pool_resample <- sample(pool_clicks,
                                size = length(pool_clicks),
                                replace = TRUE)
        
        ratio_vector[i] <- sum(pool_resample)/length(pool_clicks)
        #ratio_vector[i]
} )

t_parallel <- unlist(t_parallel)
t_parallel <- sort(t_parallel,decreasing = FALSE)
lower_bound <- t_parallel[10000*alpha/2]
upper_bound <- t_parallel[10000 - (10000*alpha/2)]

hist(t_parallel, xlab = "Proportion of #clicks")
abline(v =0.5 , col = "blue" )
abline(v = lower_bound, col = "red")
abline(v = upper_bound, col = "red")

Again it is clear that the number of clicks is randomly assigned into the experiment and control groups.

The last sanity check would be evaluation of the click-through-probability. Bootstrapping would help in this case too. What we want to measure is the significance of the difference between the click-through probability of the experiment and control groups.

One approach is calculation of the difference vector, then building confidence interval around its average using resampling method.

experiment_ctp <- experiment$Clicks / experiment$Pageviews
control_ctp <- control$Clicks / control$Pageviews

diff_ctp <- round(experiment_ctp,4) - round(control_ctp,4)
obs_diff_avg <- mean(diff_ctp)
obs_diff_avg

## [1] 6.486486e-05

Now the bootstrapping.

diff_avg_vector <- vector(length=10000)


ctp_parallel<- mclapply(X = 1:10000 , FUN = function(i){
        pool_resample <- sample(diff_ctp,
                                size = length(diff_ctp),
                                replace = TRUE)
        
        diff_avg_vector[i] <- mean(pool_resample)
 
} )

ctp_parallel <- unlist(ctp_parallel)
ctp_parallel <- sort(ctp_parallel,decreasing = FALSE)
lower_bound <- ctp_parallel[10000*alpha/2]
upper_bound <- ctp_parallel[10000 - (10000*alpha/2)]

hist(ctp_parallel, xlab = "CTP diff between two groups")
abline(v =0 , col = "blue" )
abline(v = lower_bound, col = "red")
abline(v = upper_bound, col = "red")

As it can be seen, 0 is in the CI so the difference between the two groups is insignificant.

6.Effect Size Tests

As all sanity checks were passed, and the experiment data is validated to some extent, we can go forward and analyze the evaluation metrics.

What we want to do is assessing whether the differences between evaluation metric values of the control and experiment groups are significant or not. Previously, we had chosen gross conversion and net conversion as the two evaluation metrics. We expect the gross conversion to be lower in the experiment group since the number of enrollments should be lowered, and in contrast we expect not to see any significant difference between the values of net conversion rate of the experiment and control groups.

experiment_net_c <- experiment$Payments/experiment$Clicks
experiment_net_c <- experiment_net_c[!is.na(experiment_net_c)]

control_net_c <- control$Payments/control$Clicks
control_net_c <- control_net_c[!is.na(control_net_c)]

diff_net_c <- experiment_net_c - control_net_c
avg_net_c <- mean(diff_net_c)
avg_net_c

## [1] -0.004896857

It is interesting that the net conversion value is negative, so the net conversion rate in the experiment group is lower than the control group.

Now we can use bootstrapping to check whether the observed value is exceptional under given alpha and assumed null hypothesis.

net_c_avg_vector <- vector(length=10000)

pool <- c(experiment_net_c, control_net_c)

net_c_parallel<- mclapply(X = 1:10000 , FUN = function(i){

        # exp_resample <- sample(pool,
        #                         size = length(experiment_net_c),
        #                         replace = TRUE)
        # cntl_resample <- sample(pool,
        #                         size = length(experiment_net_c),
        #                         replace = TRUE)
        # net_c_avg_vector[i] <- mean(exp_resample - cntl_resample)
        
        
        resample <- sample(diff_net_c,
                           size = length(diff_net_c),
                           replace = TRUE)
        net_c_avg_vector[i] <- mean(resample)
        
        
        
 
} )

net_c_parallel <- unlist(net_c_parallel)
net_c_parallel <- sort(net_c_parallel,decreasing = FALSE)
lower_bound <- net_c_parallel[10000*alpha/2]
upper_bound <- net_c_parallel[10000 - (10000*alpha/2)]

hist(net_c_parallel, xlab = "Net C diff between two groups")
abline(v =0 , col = "blue" )
abline(v = lower_bound, col = "red")
abline(v = upper_bound, col = "red")

As it can be seen, the difference between net conversion rate of the two expriment and control groups is insignificant. This is exactly what we expected.

The second metric is the gross conversion rate. The analysis would be exactly the same as above, but we expect a significant result for this metric. More specifically, it is expected that this metric is significantly lower in the experiment group comparing to the control group.

experiment_gross_c <- experiment$Enrollments/experiment$Clicks
experiment_gross_c <- experiment_gross_c[!is.na(experiment_gross_c)]

control_gross_c <- control$Enrollments/control$Clicks
control_gross_c <- control_gross_c[!is.na(control_gross_c)]

diff_gross_c <- experiment_gross_c - control_gross_c
avg_gross_c <- mean(diff_gross_c)
avg_gross_c

## [1] -0.02078458

The observed difference is negative, showing that the metric in the experiment group is lower than the metric in the control group.

Now bootstrapping based on the null hypothesis.

gross_c_avg_vector <- vector(length = 10000)

gross_c_parallel<- mclapply(X = 1:10000 , FUN = function(i){

        resample <- sample(diff_gross_c,
                                size = length(diff_gross_c),
                                replace = TRUE)
        
        
        gross_c_avg_vector[i] <- mean(resample)
 
} )

gross_c_parallel <- unlist(gross_c_parallel)
gross_c_parallel <- sort(gross_c_parallel,decreasing = FALSE)
lower_bound <- gross_c_parallel[10000*alpha/2]
upper_bound <- gross_c_parallel[10000 - (10000*alpha/2)]

hist(gross_c_parallel, xlab = "CI of the observerd Gross Conv Rates")
abline(v =0 , col = "blue" )
abline(v = lower_bound, col = "red")
abline(v = upper_bound, col = "red")

As it can be seen, the 0 is out of CI boundary, and it shows that the obsevered result is statistically significant. Also since the result is on the left hand side of the 0, our expectation is met and the gross conversion rate of the experiment group is lower than the control group.

sign test

Sign test is basically checking whether the signs, either positive or negative, of the differences of the metrics between two groups are meaningfully distributed over the days of the experiment or not.
For instance, if in every single day of the experiment the gross conversion rate is lower in the experiment group comparing to the control group, then being so assures us further that this metric is significantly reduced due to the experiment.

In order to check the significance of the signs, it is possible to use bootstrapping as well. The null hypothesis is the proportion of negative signs to the number of experiment days is a random and insignificant value. Thus, this is a one sample proportion test.

For the net conversion rate sign test, I consider being negative as the baseline, and I check whether the observed number of negative values is statsitically significant or not.

net_c_sign <- diff_net_c < 0 

#the observed proportion of negative signs
net_c_prop<- sum(net_c_sign)/length(net_c_sign)
net_c_prop

## [1] 0.5652174

net_c_sign_vector <- vector(length = 10000)
net_c_sign_parallel<- mclapply(X = 1:10000 , FUN = function(i){

        resample <- sample(net_c_sign,
                                size = length(net_c_sign),
                                replace = TRUE)
        
        
        net_c_sign_vector[i] <- sum(resample)/length(resample)
 
} )

net_c_sign_parallel <- unlist(net_c_sign_parallel)
net_c_sign_parallel <- sort(net_c_sign_parallel,decreasing = FALSE)
lower_bound <- net_c_sign_parallel[10000*alpha/2]
upper_bound <- net_c_sign_parallel[10000 - (10000*alpha/2)]

hist(x = net_c_sign_parallel, xlab = "Net conversion sign CI")
abline(v =0.5 , col = "blue" )
abline(v = lower_bound, col = "red")
abline(v = upper_bound, col = "red")

So the above graph shows that the observed sign data for the net conversion is totally regular, and there is nothing exeptional based on the significance level of alpha. The CI of the observed data includes 0.5 proportion, so there is no reason not to believe that this data comes from a population with proportion value = 0.5.

We do the same for the gross conversion rate.

gross_c_sign <- diff_gross_c < 0 

#the observed number of negative signs
gross_c_prop <- sum(gross_c_sign)/length(gross_c_sign)
gross_c_prop

## [1] 0.826087

gross_c_sign_vector <- vector(length = 10000)
gross_c_sign_parallel<- mclapply(X = 1:10000 , FUN = function(i){

        resample <- sample(gross_c_sign,
                                size = length(gross_c_sign),
                                replace = TRUE)
        
        
        gross_c_sign_vector[i] <- sum(resample)/length(resample)
 
} )

gross_c_sign_parallel <- unlist(gross_c_sign_parallel)
gross_c_sign_parallel <- sort(gross_c_sign_parallel,
                              decreasing = FALSE)
lower_bound <- gross_c_sign_parallel[10000*alpha/2]
upper_bound <- gross_c_sign_parallel[10000 - (10000*alpha/2)]

hist(gross_c_sign_parallel, xlab = "Gross conversion sign CI")
abline(v =0.5 , col = "blue" )
abline(v = lower_bound, col = "red")
abline(v = upper_bound, col = "red")

The gross conversion rate has a different story from the net conversion rate. The above graph shows that the signs of the gross conversion metric are not comming from a population with proportion = 0.5, so we reject this null hypothesis in favor of the alternate hypothesis.

The proportion of the negative signs in the gross conversion data is significantly higher than 0.5, and this assures us further that the gross conversion rate of the experiment group is significantly lower than the control group.

7.Recommendation

I would run the experiment. If it is possible, I would run the experiment twice to be sure about the results. However, after running the experiment, we can monitor the results, and since this change is not risky, there is not that much harm if the effect of the experiment is not as significant as we expect.