1 Business question

Challenge Description

Company XYZ is a worldwide e-commerce site with localized versions of the site. A data scientist at XYZ noticed that Spain-based users have a much higher conversion rate than any other Spanish-speaking country. She therefore went and talked to the international team in charge of Spain And LatAm to see if they had any ideas about why that was happening. Spain and LatAm country manager suggested that one reason could be translation. All Spanishspeaking countries had the same translation of the site which was written by a Spaniard. They agreed to try a test where each country would have its one translation written by a local. That is, Argentinian users would see a translation written by an Argentinian, Mexican users by a Mexican and so on. Obviously, nothing would change for users from Spain. After they run the test however, they are really surprised cause the test is negative. I.e., it appears that the non-localized translation was doing better! We are asked to:

Confirm that the test is actually negative. That is, it appears that the old version of the site with just one translation across Spain and LatAm performs better
Explain why that might be happening. Are the localized translations really worse?
If identified what was wrong, design an algorithm that would return FALSE if the same problem is happening in the future and TRUE if everything is good and the results can be trusted.

2 Loading libraries and data

rm(list=ls())
library(readr) # Efficient reading of CSV data.
library(dplyr) # Data wrangling
library(magrittr) # Pipes %>% , %<>% 
library(ggplot2) # Visualise data
library(caret) # Build models
library(pROC) # Calculate AUC
library(lubridate) # Dates and time.
library(gridExtra) # Combine plots
library(forcats) # Factor reverse fct_rev()
library(ROSE) # Sampling imbalanced data
library(tidyr) #gather
library(rpart)

setwd("E:/DS Project/data science take home test/TakeHomeDataChallenges-master/02.TranslateABTest _data_done")
test <- read_csv("test_table.csv")
user <- read_csv("user_table.csv")

3 Data processing

3.1 Data structure

test_table

The dataset contains 453321 observation with 9 variables. Some notetable descriptions about the test table are:

source : marketing channel: Ads, SEO, Direct . Direct means everything except for ads and SEO, browser_language, ads_channel : if marketing channel is ads, this is the site where the ad was displayed, browser, conversion.
test: users are randomly split into test (1) and control (0). Test users see the new translation and control the old one. For Spain-based users, this is obviously always 0 since there is no change there

glimpse(test)

## Observations: 453,321
## Variables: 9
## $ user_id          <int> 315281, 497851, 848402, 290051, 548435, 54067...
## $ date             <date> 2015-12-03, 2015-12-04, 2015-12-04, 2015-12-...
## $ source           <chr> "Direct", "Ads", "Ads", "Ads", "Ads", "Direct...
## $ device           <chr> "Web", "Web", "Web", "Mobile", "Web", "Mobile...
## $ browser_language <chr> "ES", "ES", "ES", "Other", "ES", "ES", "Other...
## $ ads_channel      <chr> NA, "Google", "Facebook", "Facebook", "Google...
## $ browser          <chr> "IE", "IE", "Chrome", "Android_App", "FireFox...
## $ conversion       <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, ...
## $ test             <int> 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, ...

user_table

It contains 452,867 observations with information about user_id,sex,age,country. The user_id is the important key to match with test table.

glimpse(user)

## Observations: 452,867
## Variables: 4
## $ user_id <int> 765821, 343561, 118744, 987753, 554597, 62371, 987967,...
## $ sex     <chr> "M", "F", "M", "F", "F", "M", "M", "F", "F", "M", "F",...
## $ age     <int> 20, 27, 23, 27, 20, 29, 31, 29, 29, 21, 49, 29, 24, 20...
## $ country <chr> "Mexico", "Nicaragua", "Colombia", "Venezuela", "Spain...

Now I will join 2 tables to obtain full information. It seems that the some information of 454 users in the test table is missing. Since the user table keeps information about the result of the test so I will maitain all user_id in this table with left_join().

After that I will remove all users Spain from the data because there is no change for Spain so we will exclude.

length(unique(test$user_id)) - length(unique(user$user_id))

## [1] 454

ds <-left_join(test,user,by = "user_id")
ds %<>% filter(country != "Spain")

## Warning: package 'bindrcpp' was built under R version 3.3.3

3.2 Slipting ratio

The split percentage between test and control users is 47% to 53%, which is close to 50/50 slipt ratio.

table(ds$test)/nrow(ds)

## 
##         0         1 
## 0.4620243 0.5379757

4 Data visualization

If the test is randomly selected, the split ratios will be the same across segments and close to 50/50 ratio as above. In this part, we will see that the selection is not the same by country and age distribution. In details, for Argentina and Uruguay, 80% of the users were part of the test group.And in the age distribution, the slipt ratios are far from even point for users older than 50 year old.

4.1 Source distribution

ds %>%  
  group_by(test,source) %>% 
  summarize(count = n()) %>% 
  spread(test,count) %>% 
  rename(Test = '1',Control = '0') %>%  
  mutate(Test = Test/(Test + Control),
         Control = 1- Test) %>% 
  gather(Group,count,-source) %>% 
  ggplot(aes(source,count,fill = Group))+geom_bar(stat= "identity",position = 'dodge')+
  labs(title="Source distribution",
       x = NULL,
       y = NULL)+
  theme(legend.position = "bottom")

4.2 Device distribution

ds %>%  
  group_by(test,device) %>% 
  summarize(count = n()) %>% 
  spread(test,count) %>% 
  rename(Test = '1',Control = '0') %>%  
  mutate(Test = Test/(Test + Control),
         Control = 1- Test) %>% 
  gather(Group,count,-device) %>% 
  ggplot(aes(device,count,fill = Group))+geom_bar(stat= "identity",position = 'dodge') + 
  labs(title="Device distribution",
       x = NULL,
       y = NULL)+
  theme(legend.position = "bottom")

4.3 Browser language distribution

ds %>%  
  group_by(test,browser_language) %>% 
  summarize(count = n()) %>% 
  spread(test,count) %>% 
  rename(Test = '1',Control = '0') %>%  
  mutate(Test = Test/(Test + Control),
         Control = 1- Test) %>% 
  gather(Group,count,-browser_language) %>% 
  ggplot(aes(browser_language,count,fill = Group))+geom_bar(stat= "identity",position = 'dodge')+
    labs(title="Browser language distribution",
       x = NULL,
       y = NULL)+
  theme(legend.position = "bottom")

4.4 Ads channel distribution

ds %>%  
  group_by(test,ads_channel) %>% 
  summarize(count = n()) %>% 
  spread(test,count) %>% 
  rename(Test = '1',Control = '0') %>%  
  mutate(Test = Test/(Test + Control),
         Control = 1- Test) %>% 
  gather(Group,count,-ads_channel) %>% 
  ggplot(aes(ads_channel,count,fill = Group))+geom_bar(stat= "identity",position = 'dodge')+
  labs(title="Ads channel distribution",
       x = NULL,
       y = NULL)+
  theme(legend.position = "bottom")

4.5 Browser distribution

ds %>%  
  group_by(test,browser) %>% 
  summarize(count = n()) %>% 
  spread(test,count) %>% 
  rename(Test = '1',Control = '0') %>%  
  mutate(Test = Test/(Test + Control),
         Control = 1- Test) %>% 
  gather(Group,count,-browser) %>% 
  ggplot(aes(browser,count,fill = Group))+geom_bar(stat= "identity",position = 'dodge')+
  labs(title="Browser distribution",
       x = NULL,
       y = NULL)+
  theme(legend.position = "bottom")

4.6 Gender distribution

ds %>%  
  group_by(test,sex) %>% 
  summarize(count = n()) %>% 
  spread(test,count) %>% 
  rename(Test = '1',Control = '0') %>%  
  mutate(Test = Test/(Test + Control),
         Control = 1- Test) %>% 
  gather(Group,count,-sex) %>% 
  ggplot(aes(sex,count,fill = Group))+geom_bar(stat= "identity",position = 'dodge')+
    labs(title="Gender distribution",
       x = NULL,
       y = NULL)+
  theme(legend.position = "bottom")

4.7 Age distribution

ds %>%  
  group_by(test,age) %>% 
  summarize(count = n()) %>% 
  spread(test,count) %>% 
  rename(Test = '1',Control = '0') %>%  
  mutate(Test = Test/(Test + Control),
         Control = 1- Test) %>% 
  gather(Group,count,-age) %>% 
  ggplot(aes(age,count,fill = Group))+geom_bar(stat= "identity",position = 'dodge')+
  labs(title="Age distribution",
       x = NULL,
       y = NULL)+
  theme(legend.position = "bottom")

## Warning: Removed 8 rows containing missing values (geom_bar).

4.8 Country distribution

ds %>%  
  group_by(test,country) %>% 
  summarize(count = n()) %>% 
  spread(test,count) %>% 
  rename(Test = '1',Control = '0') %>%  
  mutate(Test = Test/(Test + Control),
         Control = 1- Test) %>% 
  gather(Group,count,-country) %>% 
  ggplot(aes(country,count,fill = Group))+geom_bar(stat= "identity",position = 'dodge')+
  labs(title="Country distribution",
       x = NULL,
       y = NULL)+
  theme(legend.position = "bottom")+
  coord_flip()

4.9 Day distribution

ds %>%  
  group_by(test,date) %>% 
  summarize(count = n()) %>% 
  spread(test,count) %>% 
  rename(Test = '1',Control = '0') %>%  
  mutate(Test = Test/(Test + Control),
         Control = 1- Test) %>% 
  gather(Group,count,-date) %>% 
  ggplot(aes(date,count,fill = Group))+geom_bar(stat= "identity",position = 'dodge')+
  labs(title="Day distribution",
       x = NULL,
       y = NULL)+
  theme(legend.position = "bottom")+
  coord_flip()

4.10 Mean conversion rate by country by test/control group

ds %>%  
  group_by(test,country) %>% 
  summarize(count = mean(conversion)) %>% 
  spread(test,count) %>% 
  rename(Test = '1',Control = '0') %>%  
  mutate(Test = Test/(Test + Control),
         Control = 1- Test) %>% 
  gather(Group,count,-country) %>% 
  ggplot(aes(country,count,fill = Group))+geom_bar(stat= "identity",position = 'dodge')+
  labs(title="Mean conversion rate by country by test/control group",
       x = NULL,
       y = NULL)+
  theme(legend.position = "bottom")+
  coord_flip()

5 Confirm that the test is actually negative ?

Firstly, I calculate the average conversion rate between the test and control group.

ds %>% 
  group_by(test) %>% 
  summarize(mean_conversion = mean(conversion))

## # A tibble: 2 x 2
##    test mean_conversion
##   <int>           <dbl>
## 1     0      0.04829179
## 2     1      0.04341116

The result shows that the conversion rate drops from 4.8 % to 4.3% which means a 10% drop. I will make a t-test to confirm whether this difference is stastically significant.

t.test(ds$conversion[ds$test==1],ds$conversion[ds$test==0],alternative = "less")

## 
##  Welch Two Sample t-test
## 
## data:  ds$conversion[ds$test == 1] and ds$conversion[ds$test == 0]
## t = -7.3539, df = 385260, p-value = 9.645e-14
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##          -Inf -0.003788971
## sample estimates:
##  mean of x  mean of y 
## 0.04341116 0.04829179

Since the p value is less than 0.05 threshold, we may reject the null hypothesis and say that the test is actually negative except the case of bias experiment design or not enough data.

It is important to note that in the data visualization part, the country and age distribution is not the same along all segments. It is a sign of wrongly designed experiment, so let’s make further analysis.

6 Explain why that might be happening. Are the localized translations really worse?

6.1 Enough data ?

#p1: mean conversion rate for control group
#p2: mean conversion rate for test group
power.prop.test(p1=0.04829179,p2=0.04341116, sig.level = 0.05,power=0.8)

## 
##      Two-sample comparison of proportions power calculation 
## 
##               n = 28829.52
##              p1 = 0.04829179
##              p2 = 0.04341116
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

This means that we need about 28830 observations for each variation to be confident that the two conversion rates are significantly different. For a test with just a variation and an original, this means we need about 28830 * 2 = 57660 observations in total. Our dataset contain 401085 obs so we may say that the sample size is big enough.

ds %>% 
  group_by(date) %>% 
  summarize(ratio_convertion = mean (conversion[test==1])/mean (conversion[test==0])) %>% 
  ggplot(aes(date,ratio_convertion))+
  geom_line(col = "black")+
  geom_point(col = "red")+
  labs(title = "Ratio of mean conversion rate between test and control group",
       x = NULL,
       y = NULL)

Moreover, the ratio of average conversion rate between two group is quite fluctuated, it confirms that we don’t have the trouble with the size of data.However, it is important to note that the test is run in 5 days, so it may not capture the seasonal trend or pattern which should be remove from the test.

6.2 Selection bias ?

In an ideal world, the distribution of people in test and control for each segment should be the same. However, from the visualization part, we see that the distribution of country is not close to even point for Argentina and Uruguay. So there should be bias for country selecition.

We can verify the bias by building a decision tree to classify test group. If the data was split evenly between test and control groups then we shouldn’t see any splits on the tree or they should be at least close to even.

tree <- rpart(test~., ds, control = rpart.control(maxdepth = 2))
tree

## n= 401085 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 401085 99692.820 0.5379757  
##   2) country=Bolivia,Chile,Colombia,Costa Rica,Ecuador,El Salvador,Guatemala,Honduras,Mexico,Nicaragua,Panama,Paraguay,Peru,Venezuela 350218 87553.970 0.4987693 *
##   3) country=Argentina,Uruguay 50867  7894.097 0.8079108 *

We see that for Argentina and Uruguay, 80% of the users were part of the test group. Thus, there is a clear selection bias here which lead to the incorrect conclusions by the team. Now, let’s check the test results after controlling for country to correct for the bias that over/under-represented countries in each of the experiment gorups have.

ds %>% 
  group_by(country) %>% 
  summarize(test_mean_conversion = mean (conversion[test==1]),
            control_mean_conversion = mean (conversion[test==0]),
            p_value = t.test (conversion[test==1],conversion[test==0])$p.value)

## # A tibble: 16 x 4
##        country test_mean_conversion control_mean_conversion   p_value
##          <chr>                <dbl>                   <dbl>     <dbl>
##  1   Argentina           0.01372502              0.01507054 0.3351465
##  2     Bolivia           0.04790097              0.04936937 0.7188852
##  3       Chile           0.05129502              0.04810718 0.3028476
##  4    Colombia           0.05057096              0.05208949 0.4237191
##  5  Costa Rica           0.05473764              0.05225564 0.6878764
##  6     Ecuador           0.04898842              0.04915381 0.9615117
##  7 El Salvador           0.04794689              0.05355404 0.2481267
##  8   Guatemala           0.04864721              0.05064288 0.5721072
##  9    Honduras           0.04753981              0.05090576 0.4714629
## 10      Mexico           0.05118631              0.04949462 0.1655437
## 11   Nicaragua           0.05417676              0.05264697 0.7804004
## 12      Panama           0.04937028              0.04679552 0.7053268
## 13    Paraguay           0.04922910              0.04849315 0.8836965
## 14        Peru           0.05060427              0.04991404 0.7719530
## 15     Uruguay           0.01290670              0.01204819 0.8797640
## 16   Venezuela           0.04897831              0.05034367 0.5737015

After controlling for country, the test clearly appears non significant with p -value> 0.05 threshold.

In conclusion, given the fact of bias selection, we cannot give the conclusion about the difference between the two versions and that a localized translation did not make things worse as the team concluded.

In order to make better experiment, the company should find the reason why find the reason why users have different representation ratio in Argentina and Uruguay, then run the experiment again. The company should make sure price is the only difference between test and control group.Afterthat, company can run t-test again, to see whether the average revenue is significantly improved.

Spanish user AB testing

Vu Huong

31 December 2017