Project Context

Aimed to identify the average order value (“purchase_amount” in the data set) determined factors and propose the sales boosting recommendatinos


Data Pre-processing

require("tidyverse")
## Loading required package: tidyverse
## ─ Attaching packages ──────────────────── tidyverse 1.2.1 ─
## ✔ ggplot2 3.1.0     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.5
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ─ Conflicts ────────────────────── tidyverse_conflicts() ─
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
test.table<-read_csv("test_table.csv")
## Parsed with column specification:
## cols(
##   user_id = col_integer(),
##   date = col_character(),
##   device = col_character(),
##   test = col_integer(),
##   purchase_amount = col_double()
## )
user.table<-read_csv("user_table.csv")
## Parsed with column specification:
## cols(
##   user_id = col_integer(),
##   gender = col_character(),
##   service = col_character(),
##   age = col_integer(),
##   country = col_character()
## )
head(test.table)
## # A tibble: 6 x 5
##   user_id date     device       test purchase_amount
##     <int> <chr>    <chr>       <int>           <dbl>
## 1    1983 2017/1/1 pc_web          0            33.7
## 2    4704 2017/1/1 android_app     1            30.2
## 3   14995 2017/1/1 ios_app         0            37.2
## 4   15089 2017/1/1 pc_web          0            49.8
## 5   17522 2017/1/1 pc_web          0            19.7
## 6     186 2017/1/1 pc_web          1            33.6
head(user.table)
## # A tibble: 6 x 5
##   user_id gender service   age country
##     <int> <chr>  <chr>   <int> <chr>  
## 1       1 male   golden     20 US     
## 2       2 male   premium    53 US     
## 3       3 female premium    44 TW     
## 4       4 male   golden     44 UK     
## 5       5 male   normal     20 US     
## 6       6 female golden     26 UK
print(paste(" The non-repetitive users which both appear in the test.table and the user.table",
            nrow(unique(inner_join(select(test.table,user_id),
                                   select(user.table, user_id), by="user_id")))))
## [1] " The non-repetitive users which both appear in the test.table and the user.table 19871"
test.data<-left_join(test.table,
                     user.table,
                     "user_id")
test.data$date<-as.Date(test.table$date, format = "%Y/%m/%d")
for (i in c(3,4,6,7,9)) {
  test.data[, i] <-as.factor(test.data[[i]])
}
test.data.jp <- test.data[test.data$country=="JP",]
head(test.data.jp)
## # A tibble: 6 x 9
##   user_id date       device     test  purchase_amount gender service   age
##     <int> <date>     <fct>      <fct>           <dbl> <fct>  <fct>   <int>
## 1    4858 2017-01-01 android_a… 1                49.9 male   golden     14
## 2   12858 2017-01-01 pc_web     1                40.2 male   premium    40
## 3   16588 2017-01-01 pc_web     0                42.2 male   premium    18
## 4    5997 2017-01-01 android_a… 0                41.3 male   premium    32
## 5    4196 2017-01-01 pc_web     1                44.8 female premium    50
## 6     680 2017-01-01 ios_app    1                28.5 male   premium    39
## # ... with 1 more variable: country <fct>

Data Analysis

Key Message 1

Recommend to adopt the old button design because the purchase amount is higher

  • Conduct the 2 sample t test to test the hypothesis : the purchase amount of test group > control group, and we accept the hypothesis
t.test(test.data.jp[test.data.jp$test==0,]$purchase_amount,
       test.data.jp[test.data.jp$test==1,]$purchase_amount,
       alternative = "greater")
## 
##  Welch Two Sample t-test
## 
## data:  test.data.jp[test.data.jp$test == 0, ]$purchase_amount and test.data.jp[test.data.jp$test == 1, ]$purchase_amount
## t = 63.113, df = 9953.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  8.825337      Inf
## sample estimates:
## mean of x mean of y 
##  43.56543  34.50391

Key Message 2

Discover 3 factors which are statistically significant to influence the purchase amount : button design, customer device and membership class

  • Conduct ANOVA to understand the if there are statistically significant differences between the following 4 factors : button design (test), customer device, customer gender and membership class (service)
aov.model<-aov(
  purchase_amount~test+device+gender+service, 
  test.data.jp
)
summary(aov.model)
##               Df Sum Sq Mean Sq  F value Pr(>F)    
## test           1 204387  204387 5434.724 <2e-16 ***
## device         2  43561   21780  579.146 <2e-16 ***
## gender         1      3       3    0.082  0.774    
## service        2  93222   46611 1239.405 <2e-16 ***
## Residuals   9950 374196      38                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Key Message 2-1 - Membership program

Recommend to reconsider the premium membership program design because the average order value is quite close to the normal group

  • Decide whether to optimize or cancel the premium membership program to reach the expected profits by designing the membership program
ggplot(test.data.jp,
       aes(x=service,y=purchase_amount, color=service))+
         geom_boxplot()+
         xlab("Membership Level") + ylab("Purchase Amount")+
         ggtitle("The dofference in avg. order value isn't prominent between the premium and the normal membership group")+
         theme_bw()

TukeyHSD(aov.model, "service")
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = purchase_amount ~ test + device + gender + service, data = test.data.jp)
## 
## $service
##                     diff       lwr       upr p adj
## normal-golden  -7.272220 -7.671011 -6.873428     0
## premium-golden -5.568421 -5.889350 -5.247493     0
## premium-normal  1.703798  1.306328  2.101268     0
plot(TukeyHSD(aov.model,"service"))

Key Message 2-2 - Button design

Recommend to adopt the old button design because the purchase amount is higher with the old button design

  • The purchase amount is ~9.06 higher with the old button design(control) than the new button(test) design
ggplot(test.data.jp,
       aes(x=test,y=purchase_amount, color=test))+
  geom_boxplot()+
  xlab("Old v.s New button design")+ylab("Purchase Amount")+
  ggtitle("Average order value is higher when adopting old button design (0:control group)")+
  theme_bw()

Also confirm the difference isn’t stem from the random error because the statiscally significance exists

  • Conduct the TukeyHSD post-hoc test to confirm there is statistical siginficance between old (control) and new button (test) design
TukeyHSD(aov.model, "test")
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = purchase_amount ~ test + device + gender + service, data = test.data.jp)
## 
## $test
##          diff       lwr       upr p adj
## 1-0 -9.061521 -9.302464 -8.820579     0
plot(TukeyHSD(aov.model, "test"))

Key Message 2-3 - Device

Recommend to more focus on the ios users aquisition and retention becuase is has the highest average order value than other devices

ggplot(test.data.jp,
       aes(x=device, y=purchase_amount, color=device))+
  geom_boxplot()+
  xlab("Device")+ylab("Purchase Amount")+
  ggtitle("Average order value is higher with ios users than pc-web and android")+
  theme_bw()

There is 95% confidence level to state the purchase amount of ios device users are higher than the android and pc web device users

  • Purchase amount of ios user is..
  • ~5.68 higher than android users
  • ~2.00 higher than pc_web users
  • Purchase amount of pc & web user is ~3.68 higher than android
TukeyHSD(aov.model, "device")
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = purchase_amount ~ test + device + gender + service, data = test.data.jp)
## 
## $device
##                          diff       lwr       upr p adj
## ios_app-android_app  5.681998  5.290501  6.073495     0
## pc_web-android_app   3.681216  3.286400  4.076033     0
## pc_web-ios_app      -2.000782 -2.323508 -1.678056     0
plot(TukeyHSD(aov.model, "device"))