Streaming services in Australia have experienced unprecedented growth in 2020 because of bushfires and COVID-19.
Given the level of interest in the streaming market, competition is fierce and ever increasing.
Consequently, innovation is key to survival in this fast-moving market.
Project FieldStream was a pilot project to improve the recommendation engine algorithm with the goal of increasing viewership and ultimately advertising revenue for WhyNotWatch(WNW)
Change made to recommendation engine went live on 18 July 2021 at 00:00 to select group of subscribers.
Should a change to the recommendation engine algorithm be rolled out to all WNW subscribers?
In order to answer the above, statistics played a key role in the analysis of Project data as follows:
To determine key predictor variables to provide a multiple regression model to predict hours watched.
In A/B testing to ascertain whether the proposed change was effective or not with a 95% level of confidence for each demographic group.
| Date | Gender | Age | Social Metric | Time_signup | Demographic | Group | Hrs watched |
|---|---|---|---|---|---|---|---|
| Date | Factor | Integer | Integer | Numeric | Integer | Factor | Numeric |
| 2021/07/01 | F/M | 18-55 | 1-10 | 19.3 | 1-4 | A/B | 4.08 |
Dataset contains 1000 rows with 8 variables.
Hours watched is the response/dependent variable.
Demographics:
1 – Females (18-35)
2 – Males (18-35)
3 – Females (36-55)
4 – Males (36-55)
# Import dataset, check for NA and amend variable formats
z1 <- read.csv("C:/Users/verno/OneDrive/Documents/RMIT/MATH2406_02/Data/A3/streaming_data.csv")
# Check for any NA in dataframe
#is.na(z1)
# Amend variables
z1 %<>% mutate(gender = factor(gender,
levels = c("F", "M")))
#st01 %<>% mutate(social_metric = factor(social_metric,
# levels = c(0, 1,2,3,4,5,6,7,8,9,10)))
z1 %<>% mutate(date = as.Date(date, format = "%d/%m/%Y"))
#st01 %<>% mutate(demographic = factor(demographic,
# levels = c(1,2,3,4)))
z1 %<>% mutate(group = factor(group,
levels = c("A","B")))
z1 %<>% mutate(change = case_when((date < "2021-07-18" ~ "Pre"),
(date >= "2021-07-18" ~ "Post")))
z_cor <- z1
z_cor %<>% select(hours_watched, age, social_metric, time_since_signup, demographic)
From an investigation of the dataset, hours_watched represents the response(dependent) variable, whilst the following represent explanatory(independent) numeric variables:
Each of these will be checked using the cor() function individually
options(width=80)
# Determine correlation of variables using the cor() function
print('Correlation breakdown:')
## [1] "Correlation breakdown:"
print(paste('Hours_watched & Age:',(cor(z1$hours_watched, z1$age, method = 'pearson'))))
## [1] "Hours_watched & Age: -0.572844234008351"
print(paste('Hours_watched & Social metric:',(cor(z1$hours_watched, z1$social_metric,
method = 'pearson'))))
## [1] "Hours_watched & Social metric: 0.233024362586037"
print(paste('Hours_watched & Time since signup:',(cor(z1$hours_watched, z1$time_since_signup,
method = 'pearson'))))
## [1] "Hours_watched & Time since signup: -0.00574651927901223"
print(paste('Hours_watched & Demographic:',(cor(z1$hours_watched, z1$demographic,
method = 'pearson'))))
## [1] "Hours_watched & Demographic: -0.432452910479834"
# Generate correlation plot between varaibles using corrplot()
colnames(z_cor) <-c("hrs_watch","age","soc_metric","time_signup","d-graphic")
corrplot(cor(z_cor), title = "Correlation Plot",line= -0.65)
## Warning in text.default(pos.xlabel[, 1], pos.xlabel[, 2], newcolnames, srt =
## tl.srt, : "line" is not a graphical parameter
## Warning in text.default(pos.ylabel[, 1], pos.ylabel[, 2], newrownames, col =
## tl.col, : "line" is not a graphical parameter
As shown above, are the correlations between hours_watched and age, social metric, time_since_signup and demographic. The size and colour of the circles designate the strength and direction of the correlation between the variables.
# Generate Pair Plot using ggpairs()
ggpairs(z_cor, title = "Pairs Plot")
It can be seen from the graphic above that there is a negative correlation between hours watched and age and demographic. The correlation between hours watched and time_signup is virtually nil. There is a slight positive correlation between hours watched and soc_metric.
In order to determine the optimal model for the purpose of multi-variable regression, all of the variables were input and their p-values assessed. Only those with a p-value lower than 0.05 were considered. All values from the dataset(Groups A and B) were included, as the values from both groups are used as a basis of prediction, and are independent of the recommendation algorithm.
# Generate linear model for all the variables
lm1 <- lm(hours_watched ~ gender + social_metric + time_since_signup + age + demographic,
data = z1)
summary(lm1)
##
## Call:
## lm(formula = hours_watched ~ gender + social_metric + time_since_signup +
## age + demographic, data = z1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6799 -0.6550 -0.0196 0.7196 2.7463
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.350731 0.160463 39.578 <2e-16 ***
## genderM 0.151677 0.096553 1.571 0.117
## social_metric 0.097870 0.011164 8.767 <2e-16 ***
## time_since_signup 0.002681 0.004627 0.579 0.562
## age -0.063859 0.006126 -10.425 <2e-16 ***
## demographic -0.087208 0.065597 -1.329 0.184
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.054 on 994 degrees of freedom
## Multiple R-squared: 0.3779, Adjusted R-squared: 0.3748
## F-statistic: 120.8 on 5 and 994 DF, p-value: < 2.2e-16
As can be seen above, the only variables of significance as a result of their very small p-values are:
The lm() function was run again, but this time just for social_metric and age.
# Execute linear model only for social metric and age
lm2 <- lm(hours_watched ~ social_metric + age, data = z1)
summary(lm2)
##
## Call:
## lm(formula = hours_watched ~ social_metric + age, data = z1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7039 -0.6636 -0.0157 0.7398 2.8434
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.496610 0.131812 49.29 <2e-16 ***
## social_metric 0.097573 0.011139 8.76 <2e-16 ***
## age -0.070786 0.003121 -22.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.054 on 997 degrees of freedom
## Multiple R-squared: 0.3762, Adjusted R-squared: 0.3749
## F-statistic: 300.6 on 2 and 997 DF, p-value: < 2.2e-16
lm2
##
## Call:
## lm(formula = hours_watched ~ social_metric + age, data = z1)
##
## Coefficients:
## (Intercept) social_metric age
## 6.49661 0.09757 -0.07079
As shown above are the coefficients for social_metric and age.
# Generate 3D scatterplot using scatter3D()
scatter3D(x = z1$age, y = z1$hours_watched, z = z1$social_metric, colvar = z1$hours_watched,
pch = 19, cex = 1.5,
xlab = "Age", ylab = "Hours watched", zlab = "Social Metric", theta = 60, d = 4,
clab = c("Hours watched"),colkey = list(length = 0.5, width = 0.5,
cex.clab = 0.75,dist = -.08, side.clab = 3) ,
main = "3D Scatterplot of Hours watched , Age and Social metric")
As can be seen from the 3D scatterplot above, the highest hours watched was by the youngest subscribers with the highest social metric. The opposite applies for the older subscribers.
Groups A & B were checked for the existence of any bias in the sampled data.
Proportions for Control(A) and Treatment(B) groups for each demographic were compared as follows:
## Check Groups for existence of any bias
# Count the numbers in each demographic category based on the A/B group
check_a_z1 <- z1 %>%
filter(group == 'A') %>%
ungroup() %>%
select(demographic) %>%
group_by(demographic) %>%
mutate(n_a = n()) %>%
distinct()
check_b_z1 <- z1 %>%
filter(group == 'B') %>%
ungroup() %>%
select(demographic) %>%
group_by(demographic) %>%
mutate(n_b = n()) %>%
distinct()
# total numbers in each group
n_tot_a <- sum(z1$group == 'A')
n_tot_b <- sum(z1$group == 'B')
# proportions in each demographic
check_a_z1$p_a <- check_a_z1$n_a / n_tot_a
check_b_z1$p_b <- check_b_z1$n_b / n_tot_b
# join on demo categories
check_df <- inner_join(check_a_z1, check_b_z1)
## Joining, by = "demographic"
# calculate the difference in proportions
check_df$diff <- check_df$p_a - check_df$p_b
check_df
## # A tibble: 4 x 6
## # Groups: demographic [4]
## demographic n_a p_a n_b p_b diff
## <int> <int> <dbl> <int> <dbl> <dbl>
## 1 1 203 0.231 13 0.108 0.122
## 2 3 197 0.224 16 0.133 0.0905
## 3 4 244 0.277 59 0.492 -0.214
## 4 2 236 0.268 32 0.267 0.00152
As indicated above, demographic groups 1, 3 & 4 have significant differences in proportions between Control and Treament groups. This indicates that the samples are biased. Consequently, significance testing can not apply to the entire population. As a result A/B testing was performed for each individual Demographic Group.
# *****D1 >>>>>>>>>>>>>>>>>>>>>>
print('Effect calculation:')
## [1] "Effect calculation:"
cond_1A <- z1$group == 'A' & z1$demographic == 1
print(paste('1A:', sum(z1$hours_watched[cond_1A]) / sum(cond_1A)))
## [1] "1A: 5.0243842364532"
cond_1B <- z1$group == 'B' & z1$demographic == 1
print(paste('1B:', sum(z1$hours_watched[cond_1B]) / sum(cond_1B)))
## [1] "1B: 5.74538461538461"
E1 <- (sum(z1$hours_watched[cond_1B]) / sum(cond_1B)) - # Effect - Demographic Group 1
(sum(z1$hours_watched[cond_1A]) / sum(cond_1A))
E1
## [1] 0.7210004
# D1 Minimum sample size calculation
d1 <- z1
d1 %<>%
select(demographic, hours_watched) %>%
filter(demographic == 1)
d1_sd <- sd(d1$hours_watched)
d1a <- z1
d1a %<>%
select(demographic, group, hours_watched) %>%
filter(demographic == 1, group == "A")
d1b <- z1
d1b %<>%
select(demographic, group, hours_watched) %>%
filter(demographic == 1, group == "B")
d1a_mu <- mean(d1a$hours_watched)
d1a_sd <- sd(d1a$hours_watched)
d1b_mu <- mean(d1b$hours_watched)
d1b_sd <- sd(d1b$hours_watched)
d1b_mu - d1a_mu
## [1] 0.7210004
z_alpha1 <- 1.96
effect_est1 <- E1
sd_est1 <- d1_sd
n_ss1 <- ceiling((z_alpha1 * sd_est1 / effect_est1)^2)
print(paste('Min sample size D1 =', n_ss1))
## [1] "Min sample size D1 = 9"
#+++++++++++++++++++++++++++++++++++
##*******************************t-test Demographic Group 1
# Sample from t1a
t1a <- z1
t1a %<>%
filter(group =="A")
set.seed(111)
samp_t1a <- sample_n(t1a, 9)
mu_t1a <-mean(samp_t1a$hours_watched)
# Sample from t1b
t1b <- z1
t1b %<>%
filter(group =="B")
set.seed(111)
samp_t1b <- sample_n(t1b, 9)
mu_t1b <-mean(samp_t1b$hours_watched)
# t-test 1
# Ho: muTreatment - muControl <= 0
# Ha: muTreatment - muControl > 0
t.test(samp_t1b$hours_watched, samp_t1a$hours_watched, mu=0, alt="greater",
conf=0.95, var.eq=FALSE)
##
## Welch Two Sample t-test
##
## data: samp_t1b$hours_watched and samp_t1a$hours_watched
## t = 2.8047, df = 13.604, p-value = 0.007184
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.5167349 Inf
## sample estimates:
## mean of x mean of y
## 4.968333 3.574444
#create box plot 1
#jpeg(file="Boxplot1.jpeg")
boxplot(samp_t1a$hours_watched, samp_t1b$hours_watched, names = c("A", "B"),
ylab = "Hours Watched(Hrs)", xlab = "Group",
main = "Boxplots of Demographic Group 1", col=c("grey","green"))
#dev.off()
# *****D2
print('Effect calculation:')
## [1] "Effect calculation:"
cond_2A <- z1$group == 'A' & z1$demographic == 2
print(paste('2A:', sum(z1$hours_watched[cond_2A]) / sum(cond_2A)))
## [1] "2A: 5.01440677966102"
cond_2B <- z1$group == 'B' & z1$demographic == 2
print(paste('2B:', sum(z1$hours_watched[cond_2B]) / sum(cond_2B)))
## [1] "2B: 5.70625"
E2 <- (sum(z1$hours_watched[cond_2B]) / sum(cond_2B)) - # Effect - Demographic Group 2
(sum(z1$hours_watched[cond_2A]) / sum(cond_2A))
E2
## [1] 0.6918432
# D2 Minimum sample size calculation
d2 <- z1
d2 %<>%
select(demographic, hours_watched) %>%
filter(demographic == 2)
d2_sd <- sd(d2$hours_watched)
d2a <- z1
d2a %<>%
select(demographic, group, hours_watched) %>%
filter(demographic == 2, group == "A")
d2b <- z1
d2b %<>%
select(demographic, group, hours_watched) %>%
filter(demographic == 2, group == "B")
d2a_mu <- mean(d2a$hours_watched)
d2a_sd <- sd(d2a$hours_watched)
d2b_mu <- mean(d2b$hours_watched)
d2b_sd <- sd(d2b$hours_watched)
d2b_mu - d2a_mu
## [1] 0.6918432
z_alpha2 <- 1.96
effect_est2 <- E2
sd_est2 <- d2_sd
n_ss2 <- ceiling((z_alpha2 * sd_est2 / effect_est2)^2)
print(paste('Min sample size D2 = ', n_ss2))
## [1] "Min sample size D2 = 11"
#+++++++++++++++++++++++++++++++++
##*******************************t-test Demographic Group 2
# Sample from t2a
t2a <- z1
t2a %<>%
filter(group =="A")
set.seed(111)
samp_t2a <- sample_n(t2a, 11)
mu_t2a <-mean(samp_t2a$hours_watched)
# Sample from t2b
t2b <- z1
t2b %<>%
filter(group =="B")
set.seed(111)
samp_t2b <- sample_n(t2b, 11)
mu_t2b <-mean(samp_t2b$hours_watched)
# t-test 2
# Ho: muTreatment - muControl <= 0
# Ha: muTreatment - muControl > 0
t.test(samp_t2b$hours_watched, samp_t2a$hours_watched, mu=0, alt="greater",
conf=0.95, var.eq=FALSE)
##
## Welch Two Sample t-test
##
## data: samp_t2b$hours_watched and samp_t2a$hours_watched
## t = 1.964, df = 18.321, p-value = 0.03244
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.1153374 Inf
## sample estimates:
## mean of x mean of y
## 4.677273 3.699091
#create box plot 2
#jpeg(file="Boxplot2.jpeg")
boxplot(samp_t1a$hours_watched, samp_t1b$hours_watched, names = c("A", "B"),
ylab = "Hours Watched(Hrs)", xlab = "Group",
main = "Boxplots of Demographic Group 2", col=c("grey","green"))
#dev.off()
# *****D3
print('Effect calculation:')
## [1] "Effect calculation:"
cond_3A <- z1$group == 'A' & z1$demographic == 3
print(paste('3A:', sum(z1$hours_watched[cond_3A]) / sum(cond_3A)))
## [1] "3A: 3.57847715736041"
cond_3B <- z1$group == 'B' & z1$demographic == 3
print(paste('3B:', sum(z1$hours_watched[cond_3B]) / sum(cond_3B)))
## [1] "3B: 4.220625"
E3 <- (sum(z1$hours_watched[cond_3B]) / sum(cond_3B)) - # Effect - Demographic Group 3
(sum(z1$hours_watched[cond_3A]) / sum(cond_3A))
E3
## [1] 0.6421478
# D3 Minimum sample size calculation
d3 <- z1
d3 %<>%
select(demographic, hours_watched) %>%
filter(demographic == 3)
d3_sd <- sd(d3$hours_watched)
d3a <- z1
d3a %<>%
select(demographic, group, hours_watched) %>%
filter(demographic == 3, group == "A")
d3b <- z1
d3b %<>%
select(demographic, group, hours_watched) %>%
filter(demographic == 3, group == "B")
d3a_mu <- mean(d3a$hours_watched)
d3a_sd <- sd(d3a$hours_watched)
d3b_mu <- mean(d3b$hours_watched)
d3b_sd <- sd(d3b$hours_watched)
d3b_mu - d3a_mu
## [1] 0.6421478
z_alpha3 <- 1.96
effect_est3 <- E3
sd_est3 <- d3_sd
n_ss3 <- ceiling((z_alpha3 * sd_est3 / effect_est3)^2)
print(paste('Min sample size D3 = ', n_ss3))
## [1] "Min sample size D3 = 15"
#+++++++++++++++++++++++++++++++++
##*******************************t-test Demographic Group 3
# Sample from t3a
t3a <- z1
t3a %<>%
filter(group =="A")
set.seed(111)
samp_t3a <- sample_n(t3a, 15)
mu_t3a <-mean(samp_t3a$hours_watched)
# Sample from t3b
t3b <- z1
t3b %<>%
filter(group =="B")
set.seed(111)
samp_t3b <- sample_n(t3b, 15)
mu_t3b <-mean(samp_t3b$hours_watched)
# t-test 3
# Ho: muTreatment - muControl <= 0
# Ha: muTreatment - muControl > 0
t.test(samp_t3b$hours_watched, samp_t3a$hours_watched, mu=0, alt="greater",
conf=0.95, var.eq=FALSE)
##
## Welch Two Sample t-test
##
## data: samp_t3b$hours_watched and samp_t3a$hours_watched
## t = 0.20101, df = 25.022, p-value = 0.4212
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -0.7647342 Inf
## sample estimates:
## mean of x mean of y
## 4.342 4.240
#create box plot 3
#jpeg(file="Boxplot3.jpeg")
boxplot(samp_t3a$hours_watched, samp_t3b$hours_watched, names = c("A", "B"),
ylab = "Hours Watched(Hrs)", xlab = "Group",
main = "Boxplots of Demographic Group 3", col=c("grey","green"))
#dev.off()
# *****D4
print('Effect calculation:')
## [1] "Effect calculation:"
cond_4A <- z1$group == 'A' & z1$demographic == 4
print(paste('4A:', sum(z1$hours_watched[cond_4A]) / sum(cond_4A)))
## [1] "4A: 3.71918032786885"
cond_4B <- z1$group == 'B' & z1$demographic == 4
print(paste('4B:', sum(z1$hours_watched[cond_4B]) / sum(cond_4B)))
## [1] "4B: 4.27940677966102"
E4 <- (sum(z1$hours_watched[cond_4B]) / sum(cond_4B)) - # Effect - Demographic Group 4
(sum(z1$hours_watched[cond_4A]) / sum(cond_4A))
E4
## [1] 0.5602265
# D4 Minimum sample size calculation
d4 <- z1
d4 %<>%
select(demographic, hours_watched) %>%
filter(demographic == 4)
d4_sd <- sd(d4$hours_watched)
d4a <- z1
d4a %<>%
select(demographic, group, hours_watched) %>%
filter(demographic == 4, group == "A")
d4b <- z1
d4b %<>%
select(demographic, group, hours_watched) %>%
filter(demographic == 4, group == "B")
d4a_mu <- mean(d4a$hours_watched)
d4a_sd <- sd(d4a$hours_watched)
d4b_mu <- mean(d4b$hours_watched)
d4b_sd <- sd(d4b$hours_watched)
d4b_mu - d4a_mu
## [1] 0.5602265
z_alpha4 <- 1.96
effect_est4 <- E4
sd_est4 <- d4_sd
n_ss4 <- ceiling((z_alpha4 * sd_est4 / effect_est4)^2)
print(paste('Min sample size D4 is', n_ss4))
## [1] "Min sample size D4 is 17"
#+++++++++++++++++++++++++++++++++
##*******************************t-test Demographic Group 4
# Sample from t4a
t4a <- z1
t4a %<>%
filter(group =="A")
set.seed(111)
samp_t4a <- sample_n(t4a, 17)
mu_t4a <-mean(samp_t4a$hours_watched)
# Sample from t4b
t4b <- z1
t4b %<>%
filter(group =="B")
set.seed(111)
samp_t4b <- sample_n(t4b, 15)
mu_t4b <-mean(samp_t4b$hours_watched)
# t-test 4
# Ho: muTreatment - muControl <= 0
# Ha: muTreatment - muControl > 0
t.test(samp_t4b$hours_watched, samp_t4a$hours_watched, mu=0, alt="greater",
conf=0.95, var.eq=FALSE)
##
## Welch Two Sample t-test
##
## data: samp_t4b$hours_watched and samp_t4a$hours_watched
## t = 0.33033, df = 28.713, p-value = 0.3718
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -0.6618088 Inf
## sample estimates:
## mean of x mean of y
## 4.342000 4.182353
#create box plot 4
#jpeg(file="Boxplot4.jpeg")
boxplot(samp_t4a$hours_watched, samp_t4b$hours_watched, names = c("A", "B"),
ylab = "Hours Watched(Hrs)", xlab = "Group",
main = "Boxplots of Demographic Group 4", col=c("grey","green"))
#dev.off()
Hours watched:
Time since signing up has no bearing on the number of hours watched, so it cannot be used as a predictor variable.
Project limitation - samples drawn were biased between Control & Treatment groups for Demographic Groups 1,2 & 3.
Recommendation System algorithm change only worked for younger demographics.
Greater attention to be focused on ensuring sampling fairness among the control and treatment groups for each demographic in future, otherwise population inferences will not be possible.
Future A/B Tests should include people above the age of 55 as part of the sample, as they represent a niche with great potential given their availability of time and disposable income. In addition those in the sub-18 age group should be included in future data collection, as they also have growth potential.
WNW should roll out the change for subscribers in the younger demographics (Females & Males 18-35).
Bruce, P., 2020, Practical Statistics for Data Scientists, 2nd Edition, O’Reilly Media, Incorporated, Sebastopol, CA.
Dalgaard, P., 2008, Introductory Statistics with R, 2nd Edition, Springer, New York
Long, J.D. & Teetor, P., 2019, R Cookbook, 2nd Edition, O’Reilly Media, Incorporated, Sebastopol, CA.
Roy Morgan, 2021, Subscription TV viewers soared to 17.3 million Australians during 2020: Netflix, Foxtel, Stan, Disney+ & Amazon Prime all increased viewership by at least 1.5 million, viewed 9 June 2021, http://www.roymorgan.com/findings/8606-subscription-pay-tv-services-september-2020-202101120113
Viswanathan, V., Viswanathan, S., Gohil, A. & Yu-Wei, C., 2016, R: Recipes for Analysis, Visualization and Machine Learning, Packt Publishing, Birmingham, UK