Why not watch is keen to remain sustainable in this competitive streaming market industry by experimenting new methodologies to increase viewer engagement (hours) in order to maximize advertising revenue and expand the customer base.
One of the company’s strategies involves regularly changing the algorithm in the recommendation engine so it can dynamically suggest shows to viewers based on their viewing habits.
The executives in WNW has handed over this project to its analytical team to find out if the new algorithm is worth rolling to their subscribers.
This report details the findings of that analysis.
Analytical team has performed various statistical tests on the full dataset, the control group and treatment group respectively to distinguish if any positive impact has triggered for the treatment group with the new algorithm.
I will be using below techniques to resolved the question
Collecting the data from customers and checking if any data manipulation is required.
Cleaning and testing for Outliers.
Sub-setting the data to 3 groups as below
** Pre-campaign data to be used for establishing a population mean and standard deviation ** Group B data for determining if the recommendation engine has in fact worked ** Group A campaign data as the comparative baseline to measure
Identifying the distribution and key trends in the data via descriptive statistic and relevant plots or multivariate graphs
Ascertain the sample sizes for each demographic groups
Then based on the sample sizes which is if the sample size is > minimum sample size, conducting hypothesis testing to determine the recommendation engine changes impact
Linear and multi-linear regression to determine key predictors that may also boost customer engagement and viewing
Loading the data
streaming_dfnew<-read.csv("C:/data/streaming_data.csv")
head(streaming_dfnew,3)
In this area I will be just describing all the variables and further below an investigation is done to identify the most important variables of this dataset.
streaming_dfnew$gender <- as.factor(streaming_dfnew$gender)
streaming_dfnew$gender <- factor (c(streaming_dfnew$gender),levels = c("M","F"),labels=c("M","F"))
streaming_dfnew$demographic <- as.factor(streaming_dfnew$demographic)
streaming_dfnew$demographic <- factor (c(streaming_dfnew$demographic),levels = c("1","2","3","4"),labels=c("1","2","3","4"))
streaming_dfnew$group <- as.factor(streaming_dfnew$group)
streaming_dfnew$group <- factor (c(streaming_dfnew$group),levels = c("A","B"),labels=c("A","B"))
*Create Year only column.
vec <- rep(c(2022),1000)
streaming_dfclean <-streaming_dfnew %>%
mutate(newcol = vec) %>%
rename (date2 = "date")
*Concat the date column
streaming_dfcon <- streaming_dfclean
streaming_dfcon$date3 <- paste(streaming_dfcon$date2, streaming_dfcon$newcol,sep = "-")
Numerical Variables
streaming_dfcon$newdate<- dmy(streaming_dfcon$date3)
streaming_df <- streaming_dfcon %>%
select(date ='newdate',gender,age,social_metric,time_since_signup,demographic,group,hours_watched)
str(streaming_df)
## 'data.frame': 1000 obs. of 8 variables:
## $ date : Date, format: "2022-07-01" "2022-07-01" ...
## $ gender : Factor w/ 2 levels "M","F": 2 2 2 1 1 1 2 1 1 1 ...
## $ age : int 28 32 39 52 25 51 53 42 41 20 ...
## $ social_metric : int 5 7 4 10 1 0 5 6 8 7 ...
## $ time_since_signup: num 19.3 11.5 4.3 9.5 19.5 22.6 4.2 8.5 16.9 23 ...
## $ demographic : Factor w/ 4 levels "1","2","3","4": 1 1 3 4 2 4 3 4 4 2 ...
## $ group : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
## $ hours_watched : num 4.08 2.99 5.74 4.13 4.68 3.4 3.07 2.77 2.24 5.39 ...
Categorical Variables
Group viewers in the dataset by age and gender
Demographic categories Demographic 1: Younger females aged 18 to 35 years Demographic 2: Younger males aged 18 to 35 years Demographic 3: Older females aged 36 to 55 years Demographic 4: Older males aged 36 to 55 years
Group (A/B) Group A is the control:880 viewers (includes group A viewers pre and post campaign launch) Group B is the treated group: 120 (includes viewers exposed to the recommendation engine changes during the campaign)
*Below plots are generated to find out the variables and its characteristics.
qplot1 <- qplot(streaming_df$hours_watched,main="Hours Watched",fill=streaming_df$gender,xlab
="Hours Watched",ylab="Count")+
labs(fill = 'Gender')
qplot2 <-qplot(streaming_df$gender,main="Gender", fill = streaming_df$gender,xlab="Gender",ylab="Count")+
labs(fill = 'Gender')
qplot3 <-qplot(streaming_df$time_since_signup,main="Time Since Signup Distribution",fill=streaming_df$gender,xlab="Time Since Sign-Up",ylab="Count")+
labs(fill = 'Gender')
qplot4 <-qplot(streaming_df$age, streaming_df$hours_watched,main="Age vs Gender vs Hours Watched",colour=
streaming_df$gender,xlab="Age",ylab="Hours Watched")+ labs(colour = 'Gender')
grid.arrange(qplot1,qplot2,qplot3,qplot4,nrow=2)
qplot5 <-qplot(streaming_df$social_metric, streaming_df$hours_watched,main="Social metric vs Hours Watched vs Gender",colour=
streaming_df$gender,xlab="Social Metric",ylab="Hours Watched")+ labs(colour = 'Gender')
qplot6 <-qplot(streaming_df$demographic, streaming_df$hours_watched,main="Demographic vs Hours Watched vs Gender",colour=
streaming_df$gender,xlab="Demographic",ylab="Hours Watched")+ labs(colour = 'Gender')
grid.arrange(qplot5 ,qplot6,nrow=1)
plot7 <- boxplot(streaming_df$hours_watched,xlab="Hours Watched)",
main="Hours Watched")
There are several key variables that are linked via the sampling process.
According to the above analysis, it is visible that there’s gender imbalance and the dataset is Gender bias.
This could skew our test results and need to look into how this issue could be minimized when generating sampling methodologies.
Sampling methodology:
##Segment the data
##population data - pre-launch campaign data
streaming_df_pop <- streaming_df %>% subset(date<="2022-07-17")
streaming_df_pop_d1 <- streaming_df_pop %>% subset(demographic=="1")
streaming_df_pop_d2 <- streaming_df_pop%>% subset(demographic=="2")
streaming_df_pop_d3 <- streaming_df_pop %>% subset(demographic=="3")
streaming_df_pop_d4 <- streaming_df_pop %>% subset(demographic=="4")
##split by group and demographic * Group A after the launch which is from 18th July.
streaming_df_A <- streaming_df %>% subset(group=="A" & date >"2022-07-17")
streaming_df_Ad1 <- streaming_df_A %>% subset(demographic=="1")
streaming_df_Ad2 <- streaming_df_A %>% subset(demographic=="2")
streaming_df_Ad3 <- streaming_df_A %>% subset(demographic=="3")
streaming_df_Ad4 <- streaming_df_A %>% subset(demographic=="4")
streaming_df_B <- streaming_df %>% subset(group=="B" & date>"2022-07-17")
streaming_df_Bd1 <- streaming_df_B %>% subset(demographic=="1")
streaming_df_Bd2 <- streaming_df_B %>% subset(demographic=="2")
streaming_df_Bd3 <- streaming_df_B %>% subset(demographic=="3")
streaming_df_Bd4 <- streaming_df_B %>% subset(demographic=="4")
##Visualize demographic groupings by gender
#examine full dataset groupings
int_breaks_rounded <- function(x, n = 10) pretty(x, n)[round(pretty(x, n),1) %% 1 == 0]
gg1 <- ggplot()
gg1 <- gg1 + geom_point(position = position_jitter(width = 0.45,
height = 0.45),
aes(x=streaming_df$demographic,
y=streaming_df$hours_watched,
colour=(streaming_df$gender)))
gg1 <- gg1 + scale_y_continuous(breaks= int_breaks_rounded )
gg1 <- gg1 + labs(x='Demographic', y='Hours Watched',
colour='Gender')
gg1 <- gg1 + ggtitle("Full dataset")
#examine control group A groupings
int_breaks_rounded <- function(x, n = 10) pretty(x, n)[round(pretty(x, n),1) %% 1 == 0]
gg1a <- ggplot()
gg1a <- gg1a + geom_point(position = position_jitter(width = 0.45,
height = 0.45),
aes(x=streaming_df_A$demographic,
y=streaming_df_A$hours_watched,
colour=(streaming_df_A$gender)))
gg1a <- gg1a + scale_y_continuous(breaks= int_breaks_rounded )
gg1a <- gg1a + labs(x='Demographic', y='Hours Watched',
colour='Gender')
gg1a <- gg1a + ggtitle("Control Group A")
#examine treatment group B groupings
int_breaks_rounded <- function(x, n = 10) pretty(x, n)[round(pretty(x, n),1) %% 1 == 0]
gg1b <- ggplot()
gg1b <- gg1b + geom_point(position = position_jitter(width = 0.45,
height = 0.45),
aes(x=streaming_df_B$demographic,
y=streaming_df_B$hours_watched,
colour=(streaming_df_B$gender)))
gg1b <- gg1b + scale_y_continuous(breaks= int_breaks_rounded )
gg1b <- gg1b + labs(x='Demographic', y='Hours Watched',
colour='Gender')
gg1b <- gg1b + ggtitle("Treatment Group B")
grid.arrange(gg1,gg1a,gg1b,nrow = 2)
*According to the below graphs, its evident about the negative relationship between the age and viewing hours.
##Visualize demographic groupings by age
#examine full dataset groupings
int_breaks_rounded <- function(x, n = 10) pretty(x, n)[round(pretty(x, n),1) %% 1 == 0]
gg2 <- ggplot()
gg2 <- gg2 + geom_point(position = position_jitter(width = 0.45,
height = 0.45),
aes(x=streaming_df$age,
y=streaming_df$hours_watched,
colour=(streaming_df$demographic)))
gg2 <- gg2 + scale_y_continuous(breaks= int_breaks_rounded )
gg2 <- gg2 + labs(x='Age', y='Hours Watched',
colour='Demographic')
gg2 <- gg2 + ggtitle("Full dataset")
#examine control group A groupings
int_breaks_rounded <- function(x, n = 10) pretty(x, n)[round(pretty(x, n),1) %% 1 == 0]
gg2a <- ggplot()
gg2a <- gg2a + geom_point(position = position_jitter(width = 0.45,
height = 0.45),
aes(x=streaming_df_A$age,
y=streaming_df_A$hours_watched,
colour=(streaming_df_A$demographic)))
gg2a <- gg2a + scale_y_continuous(breaks= int_breaks_rounded )
gg2a <- gg2a + labs(x='Age', y='Hours Watched',
colour='Demographic')
gg2a <- gg2a + ggtitle("Control Group A")
#examine treatment group B groupings
int_breaks_rounded <- function(x, n = 10) pretty(x, n)[round(pretty(x, n),1) %% 1 == 0]
gg2b <- ggplot()
gg2b <- gg2b + geom_point(position = position_jitter(width = 0.45,
height = 0.45),
aes(x=streaming_df_B$age,
y=streaming_df_B$hours_watched,
colour=(streaming_df_B$demographic)))
gg2b <- gg2b + scale_y_continuous(breaks= int_breaks_rounded )
gg2b <- gg2b + labs(x='Age', y='Hours Watched',
colour='Demographic')
gg2b <- gg2b + ggtitle("Treatment Group B")
grid.arrange(gg2,gg2a,gg2b,nrow = 2)
check_a_df <- streaming_df %>%
filter(group == "A" ) %>%
select(gender,demographic, age) %>%
group_by(gender,demographic, age) %>%
mutate(n_a = n()) %>%
distinct()
check_b_df <- streaming_df %>%
filter(group == "B" ) %>%
select(gender,demographic, age) %>%
group_by(gender,demographic, age) %>%
mutate(n_b = n()) %>%
distinct()
n_total_a <- sum(streaming_df$group=='A')
n_total_b <- sum(streaming_df$group=='B')
check_a_df$p_a <- check_a_df$n_a / n_total_a
check_b_df$p_b <- check_b_df$n_b / n_total_b
check_df <- inner_join(check_a_df, check_b_df)
check_df$diff <- check_df$p_a - check_df$p_b
qqnorm(y=check_df$diff)
##Examine the effect 1. Population estimate for mean and standard deviation
streaming_df_pop <- streaming_df %>% subset(date<="2022-07-17")
streaming_df_pop_mean <- mean(streaming_df_pop$hours_watched)
streaming_df_pop_sd <- sd(streaming_df_pop$hours_watched)
print(paste("Population Mean :", (round(streaming_df_pop_mean,digits=2))))
## [1] "Population Mean : 4.3"
print(paste("Population SD :", (round(streaming_df_pop_sd,digits=2))))
## [1] "Population SD : 1.29"
ga_mean <- mean(streaming_df_A$hours_watched)
ga_d1_mean <- mean(streaming_df_Ad1$hours_watched)
ga_d2_mean <- mean(streaming_df_Ad2$hours_watched)
ga_d3_mean <- mean(streaming_df_Ad3$hours_watched)
ga_d4_mean <- mean(streaming_df_Ad4$hours_watched)
gb_mean <- mean(streaming_df_B$hours_watched)
gb_d1_mean <- mean(streaming_df_Bd1$hours_watched)
gb_d2_mean <- mean(streaming_df_Bd2$hours_watched)
gb_d3_mean <- mean(streaming_df_Bd3$hours_watched)
gb_d4_mean <-mean(streaming_df_Bd4$hours_watched)
print(paste("Mean of Treatment Group B :", (round(gb_mean,digits=2))))
## [1] "Mean of Treatment Group B : 4.81"
print(paste("Mean of Treatment Group B - D1 :", (round(gb_d1_mean,digits=2))))
## [1] "Mean of Treatment Group B - D1 : 5.75"
print(paste("Mean of Treatment Group B - D2 :", (round(gb_d2_mean,digits=2))))
## [1] "Mean of Treatment Group B - D2 : 5.71"
print(paste("Mean of Treatment Group B - D3 :", (round(gb_d3_mean,digits=2))))
## [1] "Mean of Treatment Group B - D3 : 4.22"
print(paste("Mean of Treatment Group B - D4 :", (round(gb_d4_mean,digits=2))))
## [1] "Mean of Treatment Group B - D4 : 4.28"
gab_mean_diff <- ga_mean-gb_mean
gabd1_mean_diff <- ga_d1_mean-gb_d1_mean
gabd2_mean_diff <- ga_d2_mean-gb_d2_mean
gabd3_mean_diff <- ga_d3_mean-gb_d3_mean
gabd4_mean_diff <- ga_d4_mean-gb_d4_mean
print(paste("Mean difference for Treatment Group B :", (round(gab_mean_diff,digits=2))))
## [1] "Mean difference for Treatment Group B : -0.41"
print(paste("Mean difference for Treatment Group B - Demographic 1 :",(round(gabd1_mean_diff,digits=2))))
## [1] "Mean difference for Treatment Group B - Demographic 1 : -0.55"
print(paste("Mean difference for Treatment Group B - Demographic 2:",(round(gabd2_mean_diff,digits=2))))
## [1] "Mean difference for Treatment Group B - Demographic 2: -0.59"
print(paste("Mean difference for Treatment Group B - Demographic 3 :",(round(gabd3_mean_diff,digits=2))))
## [1] "Mean difference for Treatment Group B - Demographic 3 : -0.66"
print(paste("Mean difference for Treatment Group B - Demographic 4 :",(round(gabd4_mean_diff,digits=2))))
## [1] "Mean difference for Treatment Group B - Demographic 4 : -0.59"
gb_sd <- sd(streaming_df_B$hours_watched)
gb_d1_sd <- sd(streaming_df_Bd1$hours_watched)
gb_d2_sd <- sd(streaming_df_Bd2$hours_watched)
gb_d3_sd <- sd(streaming_df_Bd3$hours_watched)
gb_d4_sd <-sd(streaming_df_Bd4$hours_watched)
mss_gb <- (1.96 * gb_sd / gab_mean_diff)^2
mss_gb_d1 <- (1.96 * gb_d1_sd / gabd1_mean_diff)^2
mss_gb_d2 <- (1.96 * gb_d2_sd / gabd2_mean_diff)^2
mss_gb_d3 <- (1.96 * gb_d3_sd / gabd3_mean_diff)^2
mss_gb_d4 <- (1.96 * gb_d4_sd / gabd4_mean_diff)^2
mss_gb <- round(mss_gb,digits=0)
mss_gb_d1 <- round(mss_gb_d1,digits=0)
mss_gb_d2 <-round(mss_gb_d2,digits=0)
mss_gb_d3 <-round(mss_gb_d3,digits=0)
mss_gb_d4 <-round(mss_gb_d4,digits=0)
sample_gb <- length(streaming_df_B$hours_watched)
sample_gb_d1 <- length(streaming_df_Bd1$hours_watched)
sample_gb_d2 <- length(streaming_df_Bd2$hours_watched)
sample_gb_d3 <- length(streaming_df_Bd3$hours_watched)
sample_gb_d4 <- length(streaming_df_Bd4$hours_watched)
print(paste("Treatment Group B sample size",sample_gb,"> Minimum required sample", mss_gb,"So the test is valid"))
## [1] "Treatment Group B sample size 120 > Minimum required sample 41 So the test is valid"
print(paste("Treatment Group B Demographic 1 sample",sample_gb_d1,"> Minimum required sample", mss_gb_d1,
"the test is valid"))
## [1] "Treatment Group B Demographic 1 sample 13 > Minimum required sample 6 the test is valid"
print(paste("Treatment Group B Demographic 2 sample",sample_gb_d2,"> Minimum required sample", mss_gb_d2,
"the test is valid"))
## [1] "Treatment Group B Demographic 2 sample 32 > Minimum required sample 13 the test is valid"
print(paste("Treatment Group B Demographic 3 sample",sample_gb_d3,"< Minimum required sample", mss_gb_d3,
"the test is *NOT* valid"))
## [1] "Treatment Group B Demographic 3 sample 16 < Minimum required sample 16 the test is *NOT* valid"
print(paste("Treatment Group B Demographic 4 sample",sample_gb_d4,"> Minimum required sample", mss_gb_d4,
"the test is valid"))
## [1] "Treatment Group B Demographic 4 sample 59 > Minimum required sample 15 the test is valid"
z.test(streaming_df_B$hours_watched,mu=streaming_df_pop_mean,alternative = c("greater"),sigma.x =streaming_df_pop_sd, conf.level = 0.95)
##
## One-sample z-Test
##
## data: streaming_df_B$hours_watched
## z = 4.3696, p-value = 6.224e-06
## alternative hypothesis: true mean is greater than 4.296259
## 95 percent confidence interval:
## 4.617156 NA
## sample estimates:
## mean of x
## 4.810875
z.test(streaming_df_Bd1$hours_watched,mu=streaming_df_pop_mean,alternative = c("greater"),sigma.x =streaming_df_pop_sd, conf.level = 0.95)
##
## One-sample z-Test
##
## data: streaming_df_Bd1$hours_watched
## z = 4.0499, p-value = 2.562e-05
## alternative hypothesis: true mean is greater than 4.296259
## 95 percent confidence interval:
## 5.156825 NA
## sample estimates:
## mean of x
## 5.745385
z.test(streaming_df_Bd2$hours_watched,mu=streaming_df_pop_mean,alternative = c("greater"),sigma.x =streaming_df_pop_sd, conf.level = 0.95)
##
## One-sample z-Test
##
## data: streaming_df_Bd2$hours_watched
## z = 6.1824, p-value = 3.157e-10
## alternative hypothesis: true mean is greater than 4.296259
## 95 percent confidence interval:
## 5.331115 NA
## sample estimates:
## mean of x
## 5.70625
z.test(streaming_df_Bd3$hours_watched,mu=streaming_df_pop_mean,alternative = c("greater"),sigma.x =streaming_df_pop_sd, conf.level = 0.95)
##
## One-sample z-Test
##
## data: streaming_df_Bd3$hours_watched
## z = -0.2345, p-value = 0.5927
## alternative hypothesis: true mean is greater than 4.296259
## 95 percent confidence interval:
## 3.690105 NA
## sample estimates:
## mean of x
## 4.220625
z.test(streaming_df_Bd4$hours_watched,mu=streaming_df_pop_mean,alternative = c("greater"),sigma.x =streaming_df_pop_sd, conf.level = 0.95)
##
## One-sample z-Test
##
## data: streaming_df_Bd4$hours_watched
## z = -0.10033, p-value = 0.54
## alternative hypothesis: true mean is greater than 4.296259
## 95 percent confidence interval:
## 4.003135 NA
## sample estimates:
## mean of x
## 4.279407
streaming_df_B %>% group_by(demographic,gender) %>% summarise( n = n(),
Min_Age = min(age,na.rm = TRUE),
Max_Age = max(age,na.rm = TRUE),
Mean_hrs_watch =mean(hours_watched,na.rm = TRUE),
SD_hrs_watch = sd(hours_watched, na.rm = TRUE))
Demographic 1 : Females aged 18-34 years recorded a mean of 5.75 hours which is greater than the population mean of 4.3 hours. According to the z- test, p-value(0.001) < significance level (0.05) .Therefore the treatment group D1 mean hours are significantly higher than the population mean hours.
Demographic 2 : Males aged 18-35 years recorded a mean of 5.71 hours which is greater than the population mean of 4.3 hours. According to the z- test, p-value(0.001) < significance level (0.05) .Therefore the treatment group D2 mean hours are significantly higher than the population mean hours.
Demographic 3 : Females aged 38-55 years recorded a mean of 4.22 hours which is almost similar to the population mean of 4.3 hours. According to the z- test, p-value(0.5927) > significance level (0.05) .Therefore the treatment group D3 mean hours and the population mean hours do not have a significant difference..
Demographic 4 : Males aged 36-55 years recorded a mean of 4.28 hours which is almost similar to the population mean of 4.3 hours. According to the z- test, p-value( 0.54) > significance level (0.05) .Therefore the treatment group D4 mean hours and the population mean hours do not have a significant difference..
model1 <- lm(hours_watched~age,streaming_df)
model1
##
## Call:
## lm(formula = hours_watched ~ age, data = streaming_df)
##
## Coefficients:
## (Intercept) age
## 7.00012 -0.07145
summary(model1)
##
## Call:
## lm(formula = hours_watched ~ age, data = streaming_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5696 -0.7474 0.0021 0.7706 2.9391
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.000118 0.123038 56.89 <2e-16 ***
## age -0.071453 0.003236 -22.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.093 on 998 degrees of freedom
## Multiple R-squared: 0.3282, Adjusted R-squared: 0.3275
## F-statistic: 487.5 on 1 and 998 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(model1)
par(mfrow =c(1,1))
Linearity - according to the above plot (Residual Vs Fitted), displays that the relationship between fitted values and residuals is flat (demonstrated by the flat red line). This indicates a linear relationship is being modeled.
Normality of Residulas - the Q- Q plot suggests there are no major deviations from normality. It would be safe to assume the residuals are approximately normally distributed.
Homoscedasticity - Residuals v Fitted Values plot indicates that the variance appears to remain the same especially given that red line looks quite straight, Homoscedasticity can therefore be assumed. The Scale-Location plot also indicates that homoscedasticity can be assumed given the red line appears relatively flat and the variance in the square root of the standardized residuals looks to be consistent across predicted (fitted values).
Outliers - Residuals vs Leverage plot isn’t showing any extreme outliers that are unduly influencing the model.
model2 <- lm(hours_watched~age+gender+social_metric+time_since_signup+demographic, streaming_df)
model2
##
## Call:
## lm(formula = hours_watched ~ age + gender + social_metric + time_since_signup +
## demographic, data = streaming_df)
##
## Coefficients:
## (Intercept) age genderF social_metric
## 6.173435 -0.063715 0.122741 0.097813
## time_since_signup demographic2 demographic3 demographic4
## 0.002633 0.122707 -0.249048 NA
summary(model2)
##
## Call:
## lm(formula = hours_watched ~ age + gender + social_metric + time_since_signup +
## demographic, data = streaming_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6505 -0.6608 -0.0184 0.7068 2.7745
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.173435 0.299196 20.633 <2e-16 ***
## age -0.063715 0.006128 -10.397 <2e-16 ***
## genderF 0.122741 0.145012 0.846 0.398
## social_metric 0.097813 0.011165 8.761 <2e-16 ***
## time_since_signup 0.002633 0.004628 0.569 0.570
## demographic2 0.122707 0.142342 0.862 0.389
## demographic3 -0.249048 0.153499 -1.622 0.105
## demographic4 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.054 on 993 degrees of freedom
## Multiple R-squared: 0.3785, Adjusted R-squared: 0.3747
## F-statistic: 100.8 on 6 and 993 DF, p-value: < 2.2e-16
model3 <- lm(hours_watched~age+social_metric, streaming_df)
model3
##
## Call:
## lm(formula = hours_watched ~ age + social_metric, data = streaming_df)
##
## Coefficients:
## (Intercept) age social_metric
## 6.49661 -0.07079 0.09757
summary(model3)
##
## Call:
## lm(formula = hours_watched ~ age + social_metric, data = streaming_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7039 -0.6636 -0.0157 0.7398 2.8434
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.496610 0.131812 49.29 <2e-16 ***
## age -0.070786 0.003121 -22.68 <2e-16 ***
## social_metric 0.097573 0.011139 8.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.054 on 997 degrees of freedom
## Multiple R-squared: 0.3762, Adjusted R-squared: 0.3749
## F-statistic: 300.6 on 2 and 997 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(model3)
par(mfrow =c(1,1))
Linearity - according to the above plot (Residual Vs Fitted), displays that the relationship between fitted values and residuals is flat (demonstrated by the flat red line). This indicates a linear relationship is being modeled.
Normality of Residuals - the Q- Q plot suggests there are no major deviations from normality. It would be safe to assume the residuals are approximately normally distributed.
Homoscedasticity - Residuals v Fitted Values plot indicates that the variance appears to remain the same especially given that red line looks quite straight, Homoscedasticity can therefore be assumed. The Scale-Location plot also indicates that homoscedasticity can be assumed given the red line appears relatively flat and the variance in the square root of the standardized residuals looks to be consistent across predicted (fitted values).
Outliers - Residuals vs Leverage plot isn’t showing any extreme outliers that are unduly influencing the model. Found some outliers in the hours watched variable when I was processing the data but dint treat them and going to run the model III without any outliers and see if it makes any considerable difference.
summary (streaming_df$hours_watched)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.500 3.530 4.415 4.393 5.322 8.300
##boxplot_with_outliers <- boxplot (streaming_df$hours_watched,xlab="Hours Watched",main=" Hours Watched with Outliers")
streaming_no_outliers <- streaming_df %>% filter(hours_watched < 8 & hours_watched > 0.8)
boxplot_no_outliers <- boxplot(streaming_no_outliers$hours_watched,xlab="Hours Watched",main="Hours Watched without Outliers")
##### Running the model III as in model IV with the cleaned dataset to
see if there’s any significant change
model4 <- lm(hours_watched~age+social_metric, streaming_no_outliers)
model4
##
## Call:
## lm(formula = hours_watched ~ age + social_metric, data = streaming_no_outliers)
##
## Coefficients:
## (Intercept) age social_metric
## 6.45653 -0.06895 0.09303
summary(model4)
##
## Call:
## lm(formula = hours_watched ~ age + social_metric, data = streaming_no_outliers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6687 -0.6661 -0.0296 0.7232 2.8339
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.456530 0.130228 49.58 <2e-16 ***
## age -0.068953 0.003096 -22.27 <2e-16 ***
## social_metric 0.093034 0.011023 8.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.04 on 991 degrees of freedom
## Multiple R-squared: 0.3667, Adjusted R-squared: 0.3654
## F-statistic: 286.8 on 2 and 991 DF, p-value: < 2.2e-16
The AB test results imply the fact that it is worth rolling out the new recommendation engine algorithm to WNW viewers given the statistically significant higher mean hours watched for Treatment Group B compared to Control Group A. This suggest that the algorithm appears to have increased viewer engagement measured by the increase in hours watched overall. Also we can see how the algorithm have influenced on viewing habits for different demographic groups based on age and gender.
Further we can emphasizes the fact that algorithm had the strongest impact on younger viewers aged 35 years and under especially for younger females (5.75 hours) whose viewing hours was slightly higher than younger males (5.71 hours). Further test should be conducted to determine if this relative difference between these two subgroups are valid.
The AB test results indicate that the algorithm did not influence on older male viewers aged over 36 years given that the mean hours was not statistically significantly higher than that of the control group.
A linear regression test conducted on age as the predictor variable and hours_watched as the response variable. Although there was statistically significant evidence that the data fitted the model, the R-squared indicated that age only accounted for around 32.8% of variabilty in hours watched. However, the slope of this test indicated statistically significant evidence that age is negatively related to hours watched and this matches with the negative relationship highlighted in the plots created to check the characteristics of the variables.All the assumptions were validated for the model.
Multiple regression analysis was conducted using age and social metrics as the predictor variables and hours watched as the response variable, This test validated the fact that that overall there was statistically significant evidence that the data fits a linear regression model for the two predictor variables of age and social metric scores. However, the R-squared score indicated that the combination of these two predictor variables only explain about 37.6% of the variability in hours watched. The slope coefficient for age indicated a statistically significant negative relationship with hours watched and there was statistically significant evidence for a positive relationship for most of the individual social metric scores.
This could depict that people with a higher the social metric score potentially watch more compared to people with lower scores. There is further investigations that would need to be performed for the linear regression test in terms of investigation the relative sample proportions of viewers in both the control and treatment groups.
Some of the major advantages of this study were the fairly large sample size of about 548 viewers in the control and treatment groups during the campaign which lends further weight to the validity of the results.
The subgroups were clearly identified in the data and present future opportunities for further investigations into hours watched for younger and older cohorts of viewers.
The foremost limitation to this investigation was the limited sample size especially for demographic group 3 (older females). Therefore, In future, a much larger sample size would give more precise results in terms of estimating the true change in hours watched for the entire viewing population and would also give more scope to test within subgroups as well given that minimum sample sizes are likely to be easily met.
The dataset was limited also to only the month of July which may have also skewed the data somewhat due to season factors such as winter and the start of semester 2 at university for students and also starting of schools term 3 after holidays. As such, running the campaign over a much longer period of time would perhaps give a more balanced result in terms of seasonal factors throughout the year. I suggest a six month period stretching from the middle of the winter to the middle of summer could capture more comprehensive dataset.
Another potential limitation of the investigation could be the gender imbalance that was reflected in the overall population dataset and then simulated in the samples. However employing a stratified sampling technique could minimize the gender imbalance.
.* In future increasing the sample size and running a longer campaign is high costly but considering the positive feedbacks received imply that allocating a budget for these enhancements are justifiable.
It is acceptable to conclude that, new recommendation engine algorithm is worth rolling out especially for the younger viewers under the age of 36.
We couldn’t come up with a proper test results for the older viewers as the female test was not valid due to insufficient sample size and the male viewers test was not statistically significant. Therefore it will be worthwhile to go through some further investigation regarding this criteria as there’s a negative relationship between the age and watched hours.
Further investigations should be made into the relationships between social metric scores, age and hours watched so that we might be able to better predict viewing habits and better target or increase potential advertising revenue from sponsors
It is very well advised to run another campaign with a larger sample size and over a longer period of time to actually test the viewing habits of the subgroups and to achieve more accurate results and benefits.
Higher longer time hours watched and better predicting viewing habits relate to a growing customer base and more advertising revenue from highly recognized sponsors but need to outweigh the higher cost of larger sample sizes and longer running campaigns.
Sheenal Srivastava (2020),Towards Data Science: Towards Data Science, accessed 10 October 2022.https://medium.com/@sheenalsrivastava
Phillip Cleave (2019),Smart Survey: Ways to avoid sampling bias in surveys, accessed 11 October 2022. https://www.smartsurvey.co.uk/blog/ways-to-avoid-sampling-bias-in-surveys