Introduction

Problem Statement

Should a change to the recommendation engine algorithm be rolled out to all WNW subscribers?

Data

Summary of Dataset Variables:

Date Gender Age Social Metric Time_signup Demographic Group Hrs watched
Date Factor Integer Integer Numeric Integer Factor Numeric
2021/07/01 F/M 18-55 1-10 19.3 1-4 A/B 4.08
  • Dataset contains 1000 rows with 8 variables.

  • Hours watched is the response/dependent variable.

  • Demographics:

    1 – Females (18-35)

    2 – Males (18-35)

    3 – Females (36-55)

    4 – Males (36-55)

# Import dataset, check for NA and amend variable formats

z1 <- read.csv("C:/Users/verno/OneDrive/Documents/RMIT/MATH2406_02/Data/A3/streaming_data.csv")

# Check for any NA in dataframe
#is.na(z1)

# Amend variables

z1 %<>%  mutate(gender = factor(gender, 
                                  levels = c("F", "M")))
#st01 %<>%  mutate(social_metric = factor(social_metric, 
#                                  levels = c(0, 1,2,3,4,5,6,7,8,9,10)))
z1 %<>% mutate(date = as.Date(date, format = "%d/%m/%Y"))
#st01 %<>% mutate(demographic = factor(demographic, 
#                                         levels = c(1,2,3,4)))
z1 %<>% mutate(group = factor(group, 
                                      levels = c("A","B")))

z1 %<>% mutate(change = case_when((date < "2021-07-18" ~ "Pre"),
                           (date >= "2021-07-18" ~ "Post")))

z_cor <- z1
z_cor %<>% select(hours_watched, age, social_metric, time_since_signup, demographic)

Correlation determination

From an investigation of the dataset, hours_watched represents the response(dependent) variable, whilst the following represent explanatory(independent) numeric variables:

  • Age
  • Social_metric
  • Time_since_signup
  • Demographic
  • Group

Each of these will be checked using the cor() function individually

options(width=80)
# Determine correlation of variables using the cor() function

print('Correlation breakdown:')
## [1] "Correlation breakdown:"
print(paste('Hours_watched & Age:',(cor(z1$hours_watched, z1$age, method = 'pearson'))))
## [1] "Hours_watched & Age: -0.572844234008351"
print(paste('Hours_watched & Social metric:',(cor(z1$hours_watched, z1$social_metric, 
                                                  method = 'pearson'))))
## [1] "Hours_watched & Social metric: 0.233024362586037"
print(paste('Hours_watched & Time since signup:',(cor(z1$hours_watched, z1$time_since_signup, 
                                                      method = 'pearson'))))
## [1] "Hours_watched & Time since signup: -0.00574651927901223"
print(paste('Hours_watched & Demographic:',(cor(z1$hours_watched, z1$demographic, 
                                                method = 'pearson'))))
## [1] "Hours_watched & Demographic: -0.432452910479834"
# Generate correlation plot between varaibles using corrplot()

colnames(z_cor) <-c("hrs_watch","age","soc_metric","time_signup","d-graphic")
corrplot(cor(z_cor), title = "Correlation Plot",line= -0.65)
## Warning in text.default(pos.xlabel[, 1], pos.xlabel[, 2], newcolnames, srt =
## tl.srt, : "line" is not a graphical parameter
## Warning in text.default(pos.ylabel[, 1], pos.ylabel[, 2], newrownames, col =
## tl.col, : "line" is not a graphical parameter

As shown above, are the correlations between hours_watched and age, social metric, time_since_signup and demographic. The size and colour of the circles designate the strength and direction of the correlation between the variables.

# Generate Pair Plot using ggpairs()

ggpairs(z_cor, title = "Pairs Plot")

It can be seen from the graphic above that there is a negative correlation between hours watched and age and demographic. The correlation between hours watched and time_signup is virtually nil. There is a slight positive correlation between hours watched and soc_metric.

Regression Model

In order to determine the optimal model for the purpose of multi-variable regression, all of the variables were input and their p-values assessed. Only those with a p-value lower than 0.05 were considered. All values from the dataset(Groups A and B) were included, as the values from both groups are used as a basis of prediction, and are independent of the recommendation algorithm.

# Generate linear model for all the variables

lm1 <- lm(hours_watched ~ gender + social_metric + time_since_signup + age + demographic, 
          data = z1)

summary(lm1)
## 
## Call:
## lm(formula = hours_watched ~ gender + social_metric + time_since_signup + 
##     age + demographic, data = z1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6799 -0.6550 -0.0196  0.7196  2.7463 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        6.350731   0.160463  39.578   <2e-16 ***
## genderM            0.151677   0.096553   1.571    0.117    
## social_metric      0.097870   0.011164   8.767   <2e-16 ***
## time_since_signup  0.002681   0.004627   0.579    0.562    
## age               -0.063859   0.006126 -10.425   <2e-16 ***
## demographic       -0.087208   0.065597  -1.329    0.184    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.054 on 994 degrees of freedom
## Multiple R-squared:  0.3779, Adjusted R-squared:  0.3748 
## F-statistic: 120.8 on 5 and 994 DF,  p-value: < 2.2e-16

As can be seen above, the only variables of significance as a result of their very small p-values are:

  • social_metric
  • age

The lm() function was run again, but this time just for social_metric and age.

# Execute linear model only for social metric and age

lm2 <- lm(hours_watched ~ social_metric + age, data = z1)

summary(lm2)
## 
## Call:
## lm(formula = hours_watched ~ social_metric + age, data = z1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7039 -0.6636 -0.0157  0.7398  2.8434 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.496610   0.131812   49.29   <2e-16 ***
## social_metric  0.097573   0.011139    8.76   <2e-16 ***
## age           -0.070786   0.003121  -22.68   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.054 on 997 degrees of freedom
## Multiple R-squared:  0.3762, Adjusted R-squared:  0.3749 
## F-statistic: 300.6 on 2 and 997 DF,  p-value: < 2.2e-16
lm2
## 
## Call:
## lm(formula = hours_watched ~ social_metric + age, data = z1)
## 
## Coefficients:
##   (Intercept)  social_metric            age  
##       6.49661        0.09757       -0.07079

As shown above are the coefficients for social_metric and age.

3D Scatterplot

# Generate 3D scatterplot using scatter3D()

scatter3D(x = z1$age, y = z1$hours_watched, z = z1$social_metric, colvar = z1$hours_watched,  
          pch = 19, cex = 1.5,  
          xlab = "Age", ylab = "Hours watched",  zlab = "Social Metric", theta = 60, d = 4,
          clab = c("Hours watched"),colkey = list(length = 0.5, width = 0.5, 
          cex.clab = 0.75,dist = -.08, side.clab = 3)  ,
          main = "3D Scatterplot of Hours watched , Age and Social metric")

As can be seen from the 3D scatterplot above, the highest hours watched was by the youngest subscribers with the highest social metric. The opposite applies for the older subscribers.

Sample Bias determination

  • Groups A & B were checked for the existence of any bias in the sampled data.

  • Proportions for Control(A) and Treatment(B) groups for each demographic were compared as follows:

## Check Groups for existence of any bias
# Count the numbers in each demographic category based on the A/B group
check_a_z1 <- z1 %>% 
  filter(group == 'A') %>%
  ungroup() %>%
  select(demographic) %>% 
  group_by(demographic) %>% 
  mutate(n_a = n()) %>% 
  distinct()

check_b_z1 <- z1 %>% 
  filter(group == 'B') %>%
  ungroup() %>%
  select(demographic) %>% 
  group_by(demographic) %>% 
  mutate(n_b = n()) %>% 
  distinct()

# total numbers in each group
n_tot_a <- sum(z1$group == 'A')
n_tot_b <- sum(z1$group == 'B')

# proportions in each demographic
check_a_z1$p_a <- check_a_z1$n_a / n_tot_a
check_b_z1$p_b <- check_b_z1$n_b / n_tot_b

# join on demo categories
check_df <- inner_join(check_a_z1, check_b_z1)
## Joining, by = "demographic"
# calculate the difference in proportions
check_df$diff <- check_df$p_a - check_df$p_b
check_df
## # A tibble: 4 x 6
## # Groups:   demographic [4]
##   demographic   n_a   p_a   n_b   p_b     diff
##         <int> <int> <dbl> <int> <dbl>    <dbl>
## 1           1   203 0.231    13 0.108  0.122  
## 2           3   197 0.224    16 0.133  0.0905 
## 3           4   244 0.277    59 0.492 -0.214  
## 4           2   236 0.268    32 0.267  0.00152

As indicated above, demographic groups 1, 3 & 4 have significant differences in proportions between Control and Treament groups. This indicates that the samples are biased. Consequently, significance testing can not apply to the entire population. As a result A/B testing was performed for each individual Demographic Group.

Demographic Group 1 Analysis

# *****D1 >>>>>>>>>>>>>>>>>>>>>>
print('Effect calculation:')
## [1] "Effect calculation:"
cond_1A <- z1$group == 'A' & z1$demographic == 1
print(paste('1A:', sum(z1$hours_watched[cond_1A]) / sum(cond_1A)))
## [1] "1A: 5.0243842364532"
cond_1B <- z1$group == 'B' & z1$demographic == 1
print(paste('1B:', sum(z1$hours_watched[cond_1B]) / sum(cond_1B)))
## [1] "1B: 5.74538461538461"
E1  <- (sum(z1$hours_watched[cond_1B]) / sum(cond_1B)) -    # Effect - Demographic Group 1
  (sum(z1$hours_watched[cond_1A]) / sum(cond_1A))
E1
## [1] 0.7210004
# D1 Minimum sample size calculation

d1 <- z1
d1 %<>%
  select(demographic, hours_watched) %>%
  filter(demographic == 1)
d1_sd <- sd(d1$hours_watched)

d1a <- z1
d1a %<>%
  select(demographic, group, hours_watched) %>%
  filter(demographic == 1, group == "A")

d1b <- z1
d1b %<>%
  select(demographic, group, hours_watched) %>%
  filter(demographic == 1, group == "B")

d1a_mu <- mean(d1a$hours_watched)
d1a_sd <- sd(d1a$hours_watched)

d1b_mu <- mean(d1b$hours_watched)
d1b_sd <- sd(d1b$hours_watched)

d1b_mu - d1a_mu
## [1] 0.7210004
z_alpha1 <- 1.96
effect_est1 <- E1
sd_est1 <- d1_sd
n_ss1 <- ceiling((z_alpha1 * sd_est1 / effect_est1)^2)
print(paste('Min sample size D1 =', n_ss1))
## [1] "Min sample size D1 = 9"
#+++++++++++++++++++++++++++++++++++
##*******************************t-test Demographic Group 1
# Sample from t1a

t1a <- z1
t1a %<>%
  filter(group =="A")

set.seed(111)

samp_t1a <- sample_n(t1a, 9)


mu_t1a <-mean(samp_t1a$hours_watched)

# Sample from t1b


t1b <- z1
t1b %<>%
  filter(group =="B")

set.seed(111)

samp_t1b <- sample_n(t1b, 9)


mu_t1b <-mean(samp_t1b$hours_watched)

# t-test 1

# Ho: muTreatment - muControl <= 0
# Ha: muTreatment - muControl > 0
t.test(samp_t1b$hours_watched, samp_t1a$hours_watched, mu=0, alt="greater", 
       conf=0.95, var.eq=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  samp_t1b$hours_watched and samp_t1a$hours_watched
## t = 2.8047, df = 13.604, p-value = 0.007184
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.5167349       Inf
## sample estimates:
## mean of x mean of y 
##  4.968333  3.574444
#create box plot 1
#jpeg(file="Boxplot1.jpeg")
boxplot(samp_t1a$hours_watched, samp_t1b$hours_watched, names = c("A", "B"), 
        ylab = "Hours Watched(Hrs)", xlab = "Group", 
        main = "Boxplots of Demographic Group 1", col=c("grey","green"))

#dev.off()

Demographic Group 2 Analysis

# *****D2
print('Effect calculation:')
## [1] "Effect calculation:"
cond_2A <- z1$group == 'A' & z1$demographic == 2
print(paste('2A:', sum(z1$hours_watched[cond_2A]) / sum(cond_2A)))
## [1] "2A: 5.01440677966102"
cond_2B <- z1$group == 'B' & z1$demographic == 2
print(paste('2B:', sum(z1$hours_watched[cond_2B]) / sum(cond_2B)))
## [1] "2B: 5.70625"
E2  <- (sum(z1$hours_watched[cond_2B]) / sum(cond_2B)) -  # Effect - Demographic Group 2
  (sum(z1$hours_watched[cond_2A]) / sum(cond_2A))
E2
## [1] 0.6918432
# D2 Minimum sample size calculation

d2 <- z1
d2 %<>%
  select(demographic, hours_watched) %>%
  filter(demographic == 2)
d2_sd <- sd(d2$hours_watched)

d2a <- z1
d2a %<>%
  select(demographic, group, hours_watched) %>%
  filter(demographic == 2, group == "A")

d2b <- z1
d2b %<>%
  select(demographic, group, hours_watched) %>%
  filter(demographic == 2, group == "B")

d2a_mu <- mean(d2a$hours_watched)
d2a_sd <- sd(d2a$hours_watched)

d2b_mu <- mean(d2b$hours_watched)
d2b_sd <- sd(d2b$hours_watched)

d2b_mu - d2a_mu
## [1] 0.6918432
z_alpha2 <- 1.96
effect_est2 <- E2
sd_est2 <- d2_sd
n_ss2 <- ceiling((z_alpha2 * sd_est2 / effect_est2)^2)
print(paste('Min sample size D2 = ', n_ss2))
## [1] "Min sample size D2 =  11"
#+++++++++++++++++++++++++++++++++

##*******************************t-test Demographic Group 2
# Sample from t2a

t2a <- z1
t2a %<>%
  filter(group =="A")

set.seed(111)

samp_t2a <- sample_n(t2a, 11)


mu_t2a <-mean(samp_t2a$hours_watched)

# Sample from t2b


t2b <- z1
t2b %<>%
  filter(group =="B")

set.seed(111)

samp_t2b <- sample_n(t2b, 11)


mu_t2b <-mean(samp_t2b$hours_watched)

# t-test 2

# Ho: muTreatment - muControl <= 0
# Ha: muTreatment - muControl > 0
t.test(samp_t2b$hours_watched, samp_t2a$hours_watched, mu=0, alt="greater", 
       conf=0.95, var.eq=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  samp_t2b$hours_watched and samp_t2a$hours_watched
## t = 1.964, df = 18.321, p-value = 0.03244
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.1153374       Inf
## sample estimates:
## mean of x mean of y 
##  4.677273  3.699091
#create box plot 2
#jpeg(file="Boxplot2.jpeg")
boxplot(samp_t1a$hours_watched, samp_t1b$hours_watched, names = c("A", "B"), 
        ylab = "Hours Watched(Hrs)", xlab = "Group", 
        main = "Boxplots of Demographic Group 2", col=c("grey","green"))

#dev.off()

Demographic Group 3 Analysis

# *****D3
print('Effect calculation:')
## [1] "Effect calculation:"
cond_3A <- z1$group == 'A' & z1$demographic == 3
print(paste('3A:', sum(z1$hours_watched[cond_3A]) / sum(cond_3A)))
## [1] "3A: 3.57847715736041"
cond_3B <- z1$group == 'B' & z1$demographic == 3
print(paste('3B:', sum(z1$hours_watched[cond_3B]) / sum(cond_3B)))
## [1] "3B: 4.220625"
E3  <- (sum(z1$hours_watched[cond_3B]) / sum(cond_3B)) - # Effect - Demographic Group 3
  (sum(z1$hours_watched[cond_3A]) / sum(cond_3A))
E3
## [1] 0.6421478
# D3 Minimum sample size calculation

d3 <- z1
d3 %<>%
  select(demographic, hours_watched) %>%
  filter(demographic == 3)
d3_sd <- sd(d3$hours_watched)

d3a <- z1
d3a %<>%
  select(demographic, group, hours_watched) %>%
  filter(demographic == 3, group == "A")

d3b <- z1
d3b %<>%
  select(demographic, group, hours_watched) %>%
  filter(demographic == 3, group == "B")

d3a_mu <- mean(d3a$hours_watched)
d3a_sd <- sd(d3a$hours_watched)

d3b_mu <- mean(d3b$hours_watched)
d3b_sd <- sd(d3b$hours_watched)

d3b_mu - d3a_mu
## [1] 0.6421478
z_alpha3 <- 1.96
effect_est3 <- E3
sd_est3 <- d3_sd
n_ss3 <- ceiling((z_alpha3 * sd_est3 / effect_est3)^2)
print(paste('Min sample size D3 = ', n_ss3))
## [1] "Min sample size D3 =  15"
#+++++++++++++++++++++++++++++++++
##*******************************t-test Demographic Group 3
# Sample from t3a

t3a <- z1
t3a %<>%
  filter(group =="A")

set.seed(111)

samp_t3a <- sample_n(t3a, 15)


mu_t3a <-mean(samp_t3a$hours_watched)

# Sample from t3b


t3b <- z1
t3b %<>%
  filter(group =="B")

set.seed(111)

samp_t3b <- sample_n(t3b, 15)


mu_t3b <-mean(samp_t3b$hours_watched)

# t-test 3

# Ho: muTreatment - muControl <= 0
# Ha: muTreatment - muControl > 0
t.test(samp_t3b$hours_watched, samp_t3a$hours_watched, mu=0, alt="greater", 
       conf=0.95, var.eq=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  samp_t3b$hours_watched and samp_t3a$hours_watched
## t = 0.20101, df = 25.022, p-value = 0.4212
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -0.7647342        Inf
## sample estimates:
## mean of x mean of y 
##     4.342     4.240
#create box plot 3

#jpeg(file="Boxplot3.jpeg")

boxplot(samp_t3a$hours_watched, samp_t3b$hours_watched, names = c("A", "B"), 
        ylab = "Hours Watched(Hrs)", xlab = "Group", 
        main = "Boxplots of Demographic Group 3", col=c("grey","green"))

#dev.off()

Demographic Group 4 Analysis

# *****D4
print('Effect calculation:')
## [1] "Effect calculation:"
cond_4A <- z1$group == 'A' & z1$demographic == 4
print(paste('4A:', sum(z1$hours_watched[cond_4A]) / sum(cond_4A)))
## [1] "4A: 3.71918032786885"
cond_4B <- z1$group == 'B' & z1$demographic == 4
print(paste('4B:', sum(z1$hours_watched[cond_4B]) / sum(cond_4B)))
## [1] "4B: 4.27940677966102"
E4  <- (sum(z1$hours_watched[cond_4B]) / sum(cond_4B)) - # Effect - Demographic Group 4
  (sum(z1$hours_watched[cond_4A]) / sum(cond_4A))
E4
## [1] 0.5602265
# D4 Minimum sample size calculation

d4 <- z1
d4 %<>%
  select(demographic, hours_watched) %>%
  filter(demographic == 4)
d4_sd <- sd(d4$hours_watched)

d4a <- z1
d4a %<>%
  select(demographic, group, hours_watched) %>%
  filter(demographic == 4, group == "A")

d4b <- z1
d4b %<>%
  select(demographic, group, hours_watched) %>%
  filter(demographic == 4, group == "B")

d4a_mu <- mean(d4a$hours_watched)
d4a_sd <- sd(d4a$hours_watched)

d4b_mu <- mean(d4b$hours_watched)
d4b_sd <- sd(d4b$hours_watched)

d4b_mu - d4a_mu
## [1] 0.5602265
z_alpha4 <- 1.96
effect_est4 <- E4
sd_est4 <- d4_sd
n_ss4 <- ceiling((z_alpha4 * sd_est4 / effect_est4)^2)
print(paste('Min sample size D4 is', n_ss4))
## [1] "Min sample size D4 is 17"
#+++++++++++++++++++++++++++++++++
##*******************************t-test Demographic Group 4
# Sample from t4a

t4a <- z1
t4a %<>%
  filter(group =="A")

set.seed(111)

samp_t4a <- sample_n(t4a, 17)


mu_t4a <-mean(samp_t4a$hours_watched)

# Sample from t4b


t4b <- z1
t4b %<>%
  filter(group =="B")

set.seed(111)

samp_t4b <- sample_n(t4b, 15)


mu_t4b <-mean(samp_t4b$hours_watched)

# t-test 4

# Ho: muTreatment - muControl <= 0
# Ha: muTreatment - muControl > 0
t.test(samp_t4b$hours_watched, samp_t4a$hours_watched, mu=0, alt="greater", 
       conf=0.95, var.eq=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  samp_t4b$hours_watched and samp_t4a$hours_watched
## t = 0.33033, df = 28.713, p-value = 0.3718
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -0.6618088        Inf
## sample estimates:
## mean of x mean of y 
##  4.342000  4.182353
#create box plot 4

#jpeg(file="Boxplot4.jpeg")

boxplot(samp_t4a$hours_watched, samp_t4b$hours_watched, names = c("A", "B"), 
        ylab = "Hours Watched(Hrs)", xlab = "Group", 
        main = "Boxplots of Demographic Group 4", col=c("grey","green"))

#dev.off()

Discussion

Recommendations & Conclusion

References