Introduction

Streaming services in Australia have experienced unprecedented growth in 2020 because of bushfires and COVID-19.
Given the level of interest in the streaming market, competition is fierce and ever increasing.
Consequently, innovation is key to survival in this fast-moving market.
Project FieldStream was a pilot project to improve the recommendation engine algorithm with the goal of increasing viewership and ultimately advertising revenue for WhyNotWatch(WNW)
Change made to recommendation engine went live on 18 July 2021 at 00:00 to select group of subscribers.

Problem Statement

Should a change to the recommendation engine algorithm be rolled out to all WNW subscribers?

In order to answer the above, statistics played a key role in the analysis of Project data as follows:
- To determine key predictor variables to provide a multiple regression model to predict hours watched.
- In A/B testing to ascertain whether the proposed change was effective or not with a 95% level of confidence for each demographic group.

Data

Summary of Dataset Variables:

Date	Gender	Age	Social Metric	Time_signup	Demographic	Group	Hrs watched
Date	Factor	Integer	Integer	Numeric	Integer	Factor	Numeric
2021/07/01	F/M	18-55	1-10	19.3	1-4	A/B	4.08

Dataset contains 1000 rows with 8 variables.
Hours watched is the response/dependent variable.
Demographics:

1 – Females (18-35)

2 – Males (18-35)

3 – Females (36-55)

4 – Males (36-55)

# Import dataset, check for NA and amend variable formats

z1 <- read.csv("C:/Users/verno/OneDrive/Documents/RMIT/MATH2406_02/Data/A3/streaming_data.csv")

# Check for any NA in dataframe
#is.na(z1)

# Amend variables

z1 %<>%  mutate(gender = factor(gender, 
                                  levels = c("F", "M")))
#st01 %<>%  mutate(social_metric = factor(social_metric, 
#                                  levels = c(0, 1,2,3,4,5,6,7,8,9,10)))
z1 %<>% mutate(date = as.Date(date, format = "%d/%m/%Y"))
#st01 %<>% mutate(demographic = factor(demographic, 
#                                         levels = c(1,2,3,4)))
z1 %<>% mutate(group = factor(group, 
                                      levels = c("A","B")))

z1 %<>% mutate(change = case_when((date < "2021-07-18" ~ "Pre"),
                           (date >= "2021-07-18" ~ "Post")))

z_cor <- z1
z_cor %<>% select(hours_watched, age, social_metric, time_since_signup, demographic)

Correlation determination

From an investigation of the dataset, hours_watched represents the response(dependent) variable, whilst the following represent explanatory(independent) numeric variables:

Age
Social_metric
Time_since_signup
Demographic
Group

Each of these will be checked using the cor() function individually

options(width=80)
# Determine correlation of variables using the cor() function

print('Correlation breakdown:')

## [1] "Correlation breakdown:"

print(paste('Hours_watched & Age:',(cor(z1$hours_watched, z1$age, method = 'pearson'))))

## [1] "Hours_watched & Age: -0.572844234008351"

print(paste('Hours_watched & Social metric:',(cor(z1$hours_watched, z1$social_metric, 
                                                  method = 'pearson'))))

## [1] "Hours_watched & Social metric: 0.233024362586037"

print(paste('Hours_watched & Time since signup:',(cor(z1$hours_watched, z1$time_since_signup, 
                                                      method = 'pearson'))))

## [1] "Hours_watched & Time since signup: -0.00574651927901223"

print(paste('Hours_watched & Demographic:',(cor(z1$hours_watched, z1$demographic, 
                                                method = 'pearson'))))

## [1] "Hours_watched & Demographic: -0.432452910479834"

# Generate correlation plot between varaibles using corrplot()

colnames(z_cor) <-c("hrs_watch","age","soc_metric","time_signup","d-graphic")
corrplot(cor(z_cor), title = "Correlation Plot",line= -0.65)

## Warning in text.default(pos.xlabel[, 1], pos.xlabel[, 2], newcolnames, srt =
## tl.srt, : "line" is not a graphical parameter

## Warning in text.default(pos.ylabel[, 1], pos.ylabel[, 2], newrownames, col =
## tl.col, : "line" is not a graphical parameter

As shown above, are the correlations between hours_watched and age, social metric, time_since_signup and demographic. The size and colour of the circles designate the strength and direction of the correlation between the variables.

# Generate Pair Plot using ggpairs()

ggpairs(z_cor, title = "Pairs Plot")

It can be seen from the graphic above that there is a negative correlation between hours watched and age and demographic. The correlation between hours watched and time_signup is virtually nil. There is a slight positive correlation between hours watched and soc_metric.

Regression Model

In order to determine the optimal model for the purpose of multi-variable regression, all of the variables were input and their p-values assessed. Only those with a p-value lower than 0.05 were considered. All values from the dataset(Groups A and B) were included, as the values from both groups are used as a basis of prediction, and are independent of the recommendation algorithm.

# Generate linear model for all the variables

lm1 <- lm(hours_watched ~ gender + social_metric + time_since_signup + age + demographic, 
          data = z1)

summary(lm1)

## 
## Call:
## lm(formula = hours_watched ~ gender + social_metric + time_since_signup + 
##     age + demographic, data = z1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6799 -0.6550 -0.0196  0.7196  2.7463 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        6.350731   0.160463  39.578   <2e-16 ***
## genderM            0.151677   0.096553   1.571    0.117    
## social_metric      0.097870   0.011164   8.767   <2e-16 ***
## time_since_signup  0.002681   0.004627   0.579    0.562    
## age               -0.063859   0.006126 -10.425   <2e-16 ***
## demographic       -0.087208   0.065597  -1.329    0.184    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.054 on 994 degrees of freedom
## Multiple R-squared:  0.3779, Adjusted R-squared:  0.3748 
## F-statistic: 120.8 on 5 and 994 DF,  p-value: < 2.2e-16

As can be seen above, the only variables of significance as a result of their very small p-values are:

social_metric
age

The lm() function was run again, but this time just for social_metric and age.

# Execute linear model only for social metric and age

lm2 <- lm(hours_watched ~ social_metric + age, data = z1)

summary(lm2)

## 
## Call:
## lm(formula = hours_watched ~ social_metric + age, data = z1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7039 -0.6636 -0.0157  0.7398  2.8434 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.496610   0.131812   49.29   <2e-16 ***
## social_metric  0.097573   0.011139    8.76   <2e-16 ***
## age           -0.070786   0.003121  -22.68   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.054 on 997 degrees of freedom
## Multiple R-squared:  0.3762, Adjusted R-squared:  0.3749 
## F-statistic: 300.6 on 2 and 997 DF,  p-value: < 2.2e-16

lm2

## 
## Call:
## lm(formula = hours_watched ~ social_metric + age, data = z1)
## 
## Coefficients:
##   (Intercept)  social_metric            age  
##       6.49661        0.09757       -0.07079

As shown above are the coefficients for social_metric and age.

3D Scatterplot

# Generate 3D scatterplot using scatter3D()

scatter3D(x = z1$age, y = z1$hours_watched, z = z1$social_metric, colvar = z1$hours_watched,  
          pch = 19, cex = 1.5,  
          xlab = "Age", ylab = "Hours watched",  zlab = "Social Metric", theta = 60, d = 4,
          clab = c("Hours watched"),colkey = list(length = 0.5, width = 0.5, 
          cex.clab = 0.75,dist = -.08, side.clab = 3)  ,
          main = "3D Scatterplot of Hours watched , Age and Social metric")

As can be seen from the 3D scatterplot above, the highest hours watched was by the youngest subscribers with the highest social metric. The opposite applies for the older subscribers.

Sample Bias determination

Groups A & B were checked for the existence of any bias in the sampled data.
Proportions for Control(A) and Treatment(B) groups for each demographic were compared as follows:

## Check Groups for existence of any bias
# Count the numbers in each demographic category based on the A/B group
check_a_z1 <- z1 %>% 
  filter(group == 'A') %>%
  ungroup() %>%
  select(demographic) %>% 
  group_by(demographic) %>% 
  mutate(n_a = n()) %>% 
  distinct()

check_b_z1 <- z1 %>% 
  filter(group == 'B') %>%
  ungroup() %>%
  select(demographic) %>% 
  group_by(demographic) %>% 
  mutate(n_b = n()) %>% 
  distinct()

# total numbers in each group
n_tot_a <- sum(z1$group == 'A')
n_tot_b <- sum(z1$group == 'B')

# proportions in each demographic
check_a_z1$p_a <- check_a_z1$n_a / n_tot_a
check_b_z1$p_b <- check_b_z1$n_b / n_tot_b

# join on demo categories
check_df <- inner_join(check_a_z1, check_b_z1)

## Joining, by = "demographic"

# calculate the difference in proportions
check_df$diff <- check_df$p_a - check_df$p_b
check_df

## # A tibble: 4 x 6
## # Groups:   demographic [4]
##   demographic   n_a   p_a   n_b   p_b     diff
##         <int> <int> <dbl> <int> <dbl>    <dbl>
## 1           1   203 0.231    13 0.108  0.122  
## 2           3   197 0.224    16 0.133  0.0905 
## 3           4   244 0.277    59 0.492 -0.214  
## 4           2   236 0.268    32 0.267  0.00152

As indicated above, demographic groups 1, 3 & 4 have significant differences in proportions between Control and Treament groups. This indicates that the samples are biased. Consequently, significance testing can not apply to the entire population. As a result A/B testing was performed for each individual Demographic Group.

Demographic Group 1 Analysis

# *****D1 >>>>>>>>>>>>>>>>>>>>>>
print('Effect calculation:')

## [1] "Effect calculation:"

cond_1A <- z1$group == 'A' & z1$demographic == 1
print(paste('1A:', sum(z1$hours_watched[cond_1A]) / sum(cond_1A)))

## [1] "1A: 5.0243842364532"

cond_1B <- z1$group == 'B' & z1$demographic == 1
print(paste('1B:', sum(z1$hours_watched[cond_1B]) / sum(cond_1B)))

## [1] "1B: 5.74538461538461"

E1  <- (sum(z1$hours_watched[cond_1B]) / sum(cond_1B)) -    # Effect - Demographic Group 1
  (sum(z1$hours_watched[cond_1A]) / sum(cond_1A))
E1

## [1] 0.7210004

# D1 Minimum sample size calculation

d1 <- z1
d1 %<>%
  select(demographic, hours_watched) %>%
  filter(demographic == 1)
d1_sd <- sd(d1$hours_watched)

d1a <- z1
d1a %<>%
  select(demographic, group, hours_watched) %>%
  filter(demographic == 1, group == "A")

d1b <- z1
d1b %<>%
  select(demographic, group, hours_watched) %>%
  filter(demographic == 1, group == "B")

d1a_mu <- mean(d1a$hours_watched)
d1a_sd <- sd(d1a$hours_watched)

d1b_mu <- mean(d1b$hours_watched)
d1b_sd <- sd(d1b$hours_watched)

d1b_mu - d1a_mu

## [1] 0.7210004

z_alpha1 <- 1.96
effect_est1 <- E1
sd_est1 <- d1_sd
n_ss1 <- ceiling((z_alpha1 * sd_est1 / effect_est1)^2)
print(paste('Min sample size D1 =', n_ss1))

## [1] "Min sample size D1 = 9"

#+++++++++++++++++++++++++++++++++++
##*******************************t-test Demographic Group 1
# Sample from t1a

t1a <- z1
t1a %<>%
  filter(group =="A")

set.seed(111)

samp_t1a <- sample_n(t1a, 9)


mu_t1a <-mean(samp_t1a$hours_watched)

# Sample from t1b


t1b <- z1
t1b %<>%
  filter(group =="B")

set.seed(111)

samp_t1b <- sample_n(t1b, 9)


mu_t1b <-mean(samp_t1b$hours_watched)

# t-test 1

# Ho: muTreatment - muControl <= 0
# Ha: muTreatment - muControl > 0
t.test(samp_t1b$hours_watched, samp_t1a$hours_watched, mu=0, alt="greater", 
       conf=0.95, var.eq=FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  samp_t1b$hours_watched and samp_t1a$hours_watched
## t = 2.8047, df = 13.604, p-value = 0.007184
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.5167349       Inf
## sample estimates:
## mean of x mean of y 
##  4.968333  3.574444

#create box plot 1
#jpeg(file="Boxplot1.jpeg")
boxplot(samp_t1a$hours_watched, samp_t1b$hours_watched, names = c("A", "B"), 
        ylab = "Hours Watched(Hrs)", xlab = "Group", 
        main = "Boxplots of Demographic Group 1", col=c("grey","green"))

#dev.off()

Demographic Group 2 Analysis

# *****D2
print('Effect calculation:')

## [1] "Effect calculation:"

cond_2A <- z1$group == 'A' & z1$demographic == 2
print(paste('2A:', sum(z1$hours_watched[cond_2A]) / sum(cond_2A)))

## [1] "2A: 5.01440677966102"

cond_2B <- z1$group == 'B' & z1$demographic == 2
print(paste('2B:', sum(z1$hours_watched[cond_2B]) / sum(cond_2B)))

## [1] "2B: 5.70625"

E2  <- (sum(z1$hours_watched[cond_2B]) / sum(cond_2B)) -  # Effect - Demographic Group 2
  (sum(z1$hours_watched[cond_2A]) / sum(cond_2A))
E2

## [1] 0.6918432

# D2 Minimum sample size calculation

d2 <- z1
d2 %<>%
  select(demographic, hours_watched) %>%
  filter(demographic == 2)
d2_sd <- sd(d2$hours_watched)

d2a <- z1
d2a %<>%
  select(demographic, group, hours_watched) %>%
  filter(demographic == 2, group == "A")

d2b <- z1
d2b %<>%
  select(demographic, group, hours_watched) %>%
  filter(demographic == 2, group == "B")

d2a_mu <- mean(d2a$hours_watched)
d2a_sd <- sd(d2a$hours_watched)

d2b_mu <- mean(d2b$hours_watched)
d2b_sd <- sd(d2b$hours_watched)

d2b_mu - d2a_mu

## [1] 0.6918432

z_alpha2 <- 1.96
effect_est2 <- E2
sd_est2 <- d2_sd
n_ss2 <- ceiling((z_alpha2 * sd_est2 / effect_est2)^2)
print(paste('Min sample size D2 = ', n_ss2))

## [1] "Min sample size D2 =  11"

#+++++++++++++++++++++++++++++++++

##*******************************t-test Demographic Group 2
# Sample from t2a

t2a <- z1
t2a %<>%
  filter(group =="A")

set.seed(111)

samp_t2a <- sample_n(t2a, 11)


mu_t2a <-mean(samp_t2a$hours_watched)

# Sample from t2b


t2b <- z1
t2b %<>%
  filter(group =="B")

set.seed(111)

samp_t2b <- sample_n(t2b, 11)


mu_t2b <-mean(samp_t2b$hours_watched)

# t-test 2

# Ho: muTreatment - muControl <= 0
# Ha: muTreatment - muControl > 0
t.test(samp_t2b$hours_watched, samp_t2a$hours_watched, mu=0, alt="greater", 
       conf=0.95, var.eq=FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  samp_t2b$hours_watched and samp_t2a$hours_watched
## t = 1.964, df = 18.321, p-value = 0.03244
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.1153374       Inf
## sample estimates:
## mean of x mean of y 
##  4.677273  3.699091

#create box plot 2
#jpeg(file="Boxplot2.jpeg")
boxplot(samp_t1a$hours_watched, samp_t1b$hours_watched, names = c("A", "B"), 
        ylab = "Hours Watched(Hrs)", xlab = "Group", 
        main = "Boxplots of Demographic Group 2", col=c("grey","green"))

#dev.off()

Demographic Group 3 Analysis

# *****D3
print('Effect calculation:')

## [1] "Effect calculation:"

cond_3A <- z1$group == 'A' & z1$demographic == 3
print(paste('3A:', sum(z1$hours_watched[cond_3A]) / sum(cond_3A)))

## [1] "3A: 3.57847715736041"

cond_3B <- z1$group == 'B' & z1$demographic == 3
print(paste('3B:', sum(z1$hours_watched[cond_3B]) / sum(cond_3B)))

## [1] "3B: 4.220625"

E3  <- (sum(z1$hours_watched[cond_3B]) / sum(cond_3B)) - # Effect - Demographic Group 3
  (sum(z1$hours_watched[cond_3A]) / sum(cond_3A))
E3

## [1] 0.6421478

# D3 Minimum sample size calculation

d3 <- z1
d3 %<>%
  select(demographic, hours_watched) %>%
  filter(demographic == 3)
d3_sd <- sd(d3$hours_watched)

d3a <- z1
d3a %<>%
  select(demographic, group, hours_watched) %>%
  filter(demographic == 3, group == "A")

d3b <- z1
d3b %<>%
  select(demographic, group, hours_watched) %>%
  filter(demographic == 3, group == "B")

d3a_mu <- mean(d3a$hours_watched)
d3a_sd <- sd(d3a$hours_watched)

d3b_mu <- mean(d3b$hours_watched)
d3b_sd <- sd(d3b$hours_watched)

d3b_mu - d3a_mu

## [1] 0.6421478

z_alpha3 <- 1.96
effect_est3 <- E3
sd_est3 <- d3_sd
n_ss3 <- ceiling((z_alpha3 * sd_est3 / effect_est3)^2)
print(paste('Min sample size D3 = ', n_ss3))

## [1] "Min sample size D3 =  15"

#+++++++++++++++++++++++++++++++++
##*******************************t-test Demographic Group 3
# Sample from t3a

t3a <- z1
t3a %<>%
  filter(group =="A")

set.seed(111)

samp_t3a <- sample_n(t3a, 15)


mu_t3a <-mean(samp_t3a$hours_watched)

# Sample from t3b


t3b <- z1
t3b %<>%
  filter(group =="B")

set.seed(111)

samp_t3b <- sample_n(t3b, 15)


mu_t3b <-mean(samp_t3b$hours_watched)

# t-test 3

# Ho: muTreatment - muControl <= 0
# Ha: muTreatment - muControl > 0
t.test(samp_t3b$hours_watched, samp_t3a$hours_watched, mu=0, alt="greater", 
       conf=0.95, var.eq=FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  samp_t3b$hours_watched and samp_t3a$hours_watched
## t = 0.20101, df = 25.022, p-value = 0.4212
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -0.7647342        Inf
## sample estimates:
## mean of x mean of y 
##     4.342     4.240

#create box plot 3

#jpeg(file="Boxplot3.jpeg")

boxplot(samp_t3a$hours_watched, samp_t3b$hours_watched, names = c("A", "B"), 
        ylab = "Hours Watched(Hrs)", xlab = "Group", 
        main = "Boxplots of Demographic Group 3", col=c("grey","green"))

#dev.off()

Demographic Group 4 Analysis

# *****D4
print('Effect calculation:')

## [1] "Effect calculation:"

cond_4A <- z1$group == 'A' & z1$demographic == 4
print(paste('4A:', sum(z1$hours_watched[cond_4A]) / sum(cond_4A)))

## [1] "4A: 3.71918032786885"

cond_4B <- z1$group == 'B' & z1$demographic == 4
print(paste('4B:', sum(z1$hours_watched[cond_4B]) / sum(cond_4B)))

## [1] "4B: 4.27940677966102"

E4  <- (sum(z1$hours_watched[cond_4B]) / sum(cond_4B)) - # Effect - Demographic Group 4
  (sum(z1$hours_watched[cond_4A]) / sum(cond_4A))
E4

## [1] 0.5602265

# D4 Minimum sample size calculation

d4 <- z1
d4 %<>%
  select(demographic, hours_watched) %>%
  filter(demographic == 4)
d4_sd <- sd(d4$hours_watched)

d4a <- z1
d4a %<>%
  select(demographic, group, hours_watched) %>%
  filter(demographic == 4, group == "A")

d4b <- z1
d4b %<>%
  select(demographic, group, hours_watched) %>%
  filter(demographic == 4, group == "B")

d4a_mu <- mean(d4a$hours_watched)
d4a_sd <- sd(d4a$hours_watched)

d4b_mu <- mean(d4b$hours_watched)
d4b_sd <- sd(d4b$hours_watched)

d4b_mu - d4a_mu

## [1] 0.5602265

z_alpha4 <- 1.96
effect_est4 <- E4
sd_est4 <- d4_sd
n_ss4 <- ceiling((z_alpha4 * sd_est4 / effect_est4)^2)
print(paste('Min sample size D4 is', n_ss4))

## [1] "Min sample size D4 is 17"

#+++++++++++++++++++++++++++++++++
##*******************************t-test Demographic Group 4
# Sample from t4a

t4a <- z1
t4a %<>%
  filter(group =="A")

set.seed(111)

samp_t4a <- sample_n(t4a, 17)


mu_t4a <-mean(samp_t4a$hours_watched)

# Sample from t4b


t4b <- z1
t4b %<>%
  filter(group =="B")

set.seed(111)

samp_t4b <- sample_n(t4b, 15)


mu_t4b <-mean(samp_t4b$hours_watched)

# t-test 4

# Ho: muTreatment - muControl <= 0
# Ha: muTreatment - muControl > 0
t.test(samp_t4b$hours_watched, samp_t4a$hours_watched, mu=0, alt="greater", 
       conf=0.95, var.eq=FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  samp_t4b$hours_watched and samp_t4a$hours_watched
## t = 0.33033, df = 28.713, p-value = 0.3718
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -0.6618088        Inf
## sample estimates:
## mean of x mean of y 
##  4.342000  4.182353

#create box plot 4

#jpeg(file="Boxplot4.jpeg")

boxplot(samp_t4a$hours_watched, samp_t4b$hours_watched, names = c("A", "B"), 
        ylab = "Hours Watched(Hrs)", xlab = "Group", 
        main = "Boxplots of Demographic Group 4", col=c("grey","green"))

#dev.off()

Discussion

Hours watched:
- is inversely related to age – older demographics watch less.
- increases with social metric.
Time since signing up has no bearing on the number of hours watched, so it cannot be used as a predictor variable.
Project limitation - samples drawn were biased between Control & Treatment groups for Demographic Groups 1,2 & 3.
Recommendation System algorithm change only worked for younger demographics.

Recommendations & Conclusion

Greater attention to be focused on ensuring sampling fairness among the control and treatment groups for each demographic in future, otherwise population inferences will not be possible.
Future A/B Tests should include people above the age of 55 as part of the sample, as they represent a niche with great potential given their availability of time and disposable income. In addition those in the sub-18 age group should be included in future data collection, as they also have growth potential.
WNW should roll out the change for subscribers in the younger demographics (Females & Males 18-35).

References

Bruce, P., 2020, Practical Statistics for Data Scientists, 2nd Edition, O’Reilly Media, Incorporated, Sebastopol, CA.
Dalgaard, P., 2008, Introductory Statistics with R, 2nd Edition, Springer, New York
Long, J.D. & Teetor, P., 2019, R Cookbook, 2nd Edition, O’Reilly Media, Incorporated, Sebastopol, CA.
Roy Morgan, 2021, Subscription TV viewers soared to 17.3 million Australians during 2020: Netflix, Foxtel, Stan, Disney+ & Amazon Prime all increased viewership by at least 1.5 million, viewed 9 June 2021, http://www.roymorgan.com/findings/8606-subscription-pay-tv-services-september-2020-202101120113
Viswanathan, V., Viswanathan, S., Gohil, A. & Yu-Wei, C., 2016, R: Recipes for Analysis, Visualization and Machine Learning, Packt Publishing, Birmingham, UK

MATH2406 - Assessment 3 - Project FieldStream Findings

Vernon Chun (s3883131)

17 June 2021