Rationale

The aim of the work was to conduct A/B testing of a clickstream datacontaining a dataset of like and share actions of a website where a change a word on the webpage is changed from tools to tips for one month along with average time spent on the website for each observation

Downloading and importing data file (in the working directory)

# download.file("https://assets.datacamp.com/production/repositories/2292/datasets/b502094e5de478105cccea959d4f915a7c0afe35/data_viz_website_2018_04.csv",
#                'A_B_test.csv',
#                quiet = TRUE)

data_file_path<-paste0(getwd(),'/A_B_test.csv')

data_file<-read_csv(data_file_path)
## Parsed with column specification:
## cols(
##   visit_date = col_date(format = ""),
##   condition = col_character(),
##   time_spent_homepage_sec = col_double(),
##   clicked_article = col_double(),
##   clicked_like = col_double(),
##   clicked_share = col_double()
## )
DT::datatable(data_file)

Exploring and visualizing the data

Time duration of the dataset

range(data_file$visit_date)
## [1] "2018-04-01" "2018-04-30"

Average conversion rate of likes and shares for condition: tools and tips

Data summary

data_file%>%
    gather(click_types,
           click_values,
           clicked_like:clicked_share)%>%
    group_by(click_types,
             condition)%>%
    summarise(conversion_rate=mean(click_values))
click_types condition conversion_rate
clicked_like tips 0.1662667
clicked_like tools 0.0690667
clicked_share tips 0.0328667
clicked_share tools 0.0300000

Data Visualization

data_file%>%
    gather(click_types,
           click_values,
           clicked_like:clicked_share)%>%
    group_by(click_types,
             condition)%>%
    summarise(conversion_rate=mean(click_values))%>%
    spread(condition,conversion_rate)%>%
    plot_ly(x = ~click_types,
            y=~tips,
            type = "bar",
            name = 'tips')%>%
    add_trace(y = ~tools, name = 'tools')%>%
    layout(barmode = 'group',
           title = 'Barchart of conversion rates of likes and shares',
           xaxis = list(title = "Variant"),
           yaxis = list(title = "Conversion rates"))

Mean weekly click types based on the condition

Obtaining the mean weekly click types based on the condition

data_file%>%
    gather(click_types,
           click_values,
           clicked_like:clicked_share)%>%
    group_by(week(visit_date),
             condition,
             click_types)%>%
    summarise(conversion_rate=mean(click_values))%>%
    arrange(`week(visit_date)`)%>%
    DT::datatable(.)

Visualization of the above data

mean_clicks_viz<-data_file%>%
    gather(click_types,
           click_values,
           clicked_like:clicked_share)%>%
    group_by(week(visit_date),
             condition,
             click_types)%>%
    summarise(conversion_rate=mean(click_values))%>%
    ggplot(aes(x=`week(visit_date)`,
               y=conversion_rate,
               col=condition,
               group=condition))+
    geom_point(size=3)+
    geom_line(lwd=0.9)+
    scale_y_continuous(limits = c(0, 1),
                       labels = percent)+
    facet_grid(~click_types,
               scales = 'free')+
    theme_bw(base_size = 18)+
    scale_color_manual(values = c("steelblue","forestgreen"))+
    ylab("conversion rates")+
    xlab("week")


mean_clicks_viz

The above plot and data summary shows that the average conversion rates vary in the like action depending on the condition. The word tips seems to have a higher conversion rate than the word tools.

Logistic regression for testing significance

We can check whether the difference in likes and shares between the two variants is significant respectively. Binary logistic regression is performed with clicks and shares being the dependent variable and the condition (tips and tools) being the independent descriptor.

require(broom)
## Loading required package: broom

clicked like

logistic_reg_model1 <- glm(clicked_like ~ condition,
                           family = "binomial",
                           data = data_file) %>%
    tidy()


logistic_reg_model1
term estimate std.error statistic p.value
(Intercept) -1.6123207 0.0219300 -73.52131 0
conditiontools -0.9887948 0.0389587 -25.38057 0

Results from the model clearly indicate a significance between the like action conversion rates of the two variants

clicked shared

logistic_reg_model2 <- glm(clicked_share ~ condition,
                           family = "binomial",
                           data = data_file) %>%
    tidy()


logistic_reg_model2
term estimate std.error statistic p.value
(Intercept) -3.3818774 0.0457966 -73.845632 0.0000000
conditiontools -0.0942213 0.0662440 -1.422337 0.1549285

Results from the model clearly indicate no significance between the share action conversion rates of the two variants

From the Figure ‘a’ from subsection 2 and subsequent logistic regression analysis logistic_reg_model2, it can be seen that the conversion rates of clicked share was not statistically significant between the two variants (tips and tools).

A Hypothetical followup experiment

Hypothetically, if the base conversion rate of clicked share needs to be improved by say 5%, the new sample size for a followup experiment needs to be determined.

require(powerMediation)
## Loading required package: powerMediation

Computing the sample size

total_sample_size <- SSizeLogisticBin(p1 = 0.032,
                                      p2 = 0.082,
                                      B = 0.5,
                                      a = 0.05,
                                      p = 0.8)


total_sample_size
## [1] 673

Therefore, information using 673 more samples need to be collected to improve the base conversion rate by 5%

Exploring and asessing the difference in the time spent on the website based on the two variants (tips and tools)

Obtaining the mean weekly click types based on the condition

data_file%>%
    gather(click_types,
           click_values,
           clicked_article:clicked_share)%>%
    group_by(condition)%>%
    summarise(time_spent=mean(time_spent_homepage_sec,na.rm = T))%>%
    arrange(desc(time_spent))
condition time_spent
tips 49.99909
tools 49.99489

Visualization of the above data

time_data_week_viz<-data_file%>%
    gather(click_types,
           click_values,
           clicked_like:clicked_share)%>%
    group_by(week(visit_date),
             condition)%>%
    summarise(time_spent=mean(time_spent_homepage_sec))%>%
    ggplot(aes(x=condition,
               y=time_spent,
               fill=condition))+
    geom_boxplot(col='black',
                 lwd=0.7)+
    theme_bw(base_size = 18)+
    ylab("avg time spent (seconds)")+
    xlab("condition")+
    scale_fill_manual(values = c("steelblue",'grey50'))+
    theme(legend.position = "none")


ggplotly(time_data_week_viz)

The data visualization shows variation in the time spent on the homepage given the condition (‘tips’ or ‘tools’). Based on the boxplot, the median time spent seems slightly dissimilar for both the conditions while data summary shows that the means are equal

We use a t-test to check whether the difference in the time spent on homepage between the two variants is statsitically significant or not

ab_experiment_results <- t.test(time_spent_homepage_sec ~ condition,
                                data = data_file)

#Results

ab_experiment_results
## 
##  Welch Two Sample t-test
## 
## data:  time_spent_homepage_sec by condition
## t = 0.36288, df = 29997, p-value = 0.7167
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.01850573  0.02691480
## sample estimates:
##  mean in group tips mean in group tools 
##            49.99909            49.99489

From the analsysis it is clear that there is no significant difference in the time spent on the webpages using each of the two variants