Recommendation Engine Algorithm Analysis

Introduction

Why not watch is keen to remain sustainable in this competitive streaming market industry by experimenting new methodologies to increase viewer engagement (hours) in order to maximize advertising revenue and expand the customer base.
One of the company’s strategies involves regularly changing the algorithm in the recommendation engine so it can dynamically suggest shows to viewers based on their viewing habits.
The executives in WNW has handed over this project to its analytical team to find out if the new algorithm is worth rolling to their subscribers.
This report details the findings of that analysis.
Analytical team has performed various statistical tests on the full dataset, the control group and treatment group respectively to distinguish if any positive impact has triggered for the treatment group with the new algorithm.

Problem Statement

State the overall problem/question driving the investigation

Should the WNW Streaming Company launch the new recommendation engine algorithm to all their subscribers.

Summaries how you will use statistics to solve the problem or answer your question

I will be using below techniques to resolved the question

Collecting the data from customers and checking if any data manipulation is required.
Cleaning and testing for Outliers.
Sub-setting the data to 3 groups as below

** Pre-campaign data to be used for establishing a population mean and standard deviation ** Group B data for determining if the recommendation engine has in fact worked ** Group A campaign data as the comparative baseline to measure
Identifying the distribution and key trends in the data via descriptive statistic and relevant plots or multivariate graphs
Ascertain the sample sizes for each demographic groups
Then based on the sample sizes which is if the sample size is > minimum sample size, conducting hypothesis testing to determine the recommendation engine changes impact
Linear and multi-linear regression to determine key predictors that may also boost customer engagement and viewing

Data

Loading the data

streaming_dfnew<-read.csv("C:/data/streaming_data.csv")
head(streaming_dfnew,3)

List and explain the important variables

In this area I will be just describing all the variables and further below an investigation is done to identify the most important variables of this dataset.

Loaded the entire dataset and noted there are 1000 viewers.
The key variable monitored was hours_watched to indicate if viewer engagement increased as a result of the recommendation engine changes.
Demographic and Group variables will assist with originating the samples for A/B Testing
Age and Gender will support analyzing the viewer’s demographic details.
Social metric can be utilized to assess the previous viewing habits of the viewers.

If you have factors, explain each level

Converting the gender variable to factor variable and assigning levels - M(Male) & F (Female).
Converting the demographic variable to factor variable and assigning levels - 1 ,2 ,3, 4
Converting the group variable to factor variable and assigning levels - A ,B

streaming_dfnew$gender <- as.factor(streaming_dfnew$gender)
streaming_dfnew$gender <- factor (c(streaming_dfnew$gender),levels = c("M","F"),labels=c("M","F"))
streaming_dfnew$demographic <- as.factor(streaming_dfnew$demographic)
streaming_dfnew$demographic <- factor (c(streaming_dfnew$demographic),levels = c("1","2","3","4"),labels=c("1","2","3","4"))
streaming_dfnew$group <- as.factor(streaming_dfnew$group)
streaming_dfnew$group <- factor (c(streaming_dfnew$group),levels = c("A","B"),labels=c("A","B"))

*Create Year only column.

vec <- rep(c(2022),1000)
streaming_dfclean <-streaming_dfnew %>% 
  mutate(newcol = vec) %>% 
  rename (date2 = "date")

*Concat the date column

streaming_dfcon <- streaming_dfclean
streaming_dfcon$date3 <- paste(streaming_dfcon$date2, streaming_dfcon$newcol,sep = "-")

Explain the scale of numeric variables

Numerical Variables

Date - Converted date variable to R date type

streaming_dfcon$newdate<- dmy(streaming_dfcon$date3)
streaming_df <- streaming_dfcon %>% 
  select(date ='newdate',gender,age,social_metric,time_since_signup,demographic,group,hours_watched)

str(streaming_df)

## 'data.frame':    1000 obs. of  8 variables:
##  $ date             : Date, format: "2022-07-01" "2022-07-01" ...
##  $ gender           : Factor w/ 2 levels "M","F": 2 2 2 1 1 1 2 1 1 1 ...
##  $ age              : int  28 32 39 52 25 51 53 42 41 20 ...
##  $ social_metric    : int  5 7 4 10 1 0 5 6 8 7 ...
##  $ time_since_signup: num  19.3 11.5 4.3 9.5 19.5 22.6 4.2 8.5 16.9 23 ...
##  $ demographic      : Factor w/ 4 levels "1","2","3","4": 1 1 3 4 2 4 3 4 4 2 ...
##  $ group            : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
##  $ hours_watched    : num  4.08 2.99 5.74 4.13 4.68 3.4 3.07 2.77 2.24 5.39 ...

Age of the customer (years)
Number of months since the customer sign up - data is in decimal months-
Number of hours watched that day

Categorical Variables

Group viewers in the dataset by age and gender
Demographic categories Demographic 1: Younger females aged 18 to 35 years Demographic 2: Younger males aged 18 to 35 years Demographic 3: Older females aged 36 to 55 years Demographic 4: Older males aged 36 to 55 years
Group (A/B) Group A is the control:880 viewers (includes group A viewers pre and post campaign launch) Group B is the treated group: 120 (includes viewers exposed to the recommendation engine changes during the campaign)

Analysis

Summarise the important variables in your investigation and Use visualization to highlight interesting features of the data and tell the overall story

Identifying the Key Variables and features

*Below plots are generated to find out the variables and its characteristics.

qplot1 <- qplot(streaming_df$hours_watched,main="Hours Watched",fill=streaming_df$gender,xlab
="Hours Watched",ylab="Count")+
 labs(fill = 'Gender')
qplot2 <-qplot(streaming_df$gender,main="Gender", fill = streaming_df$gender,xlab="Gender",ylab="Count")+
 labs(fill = 'Gender')
qplot3 <-qplot(streaming_df$time_since_signup,main="Time Since Signup Distribution",fill=streaming_df$gender,xlab="Time Since Sign-Up",ylab="Count")+
 labs(fill = 'Gender')
 qplot4 <-qplot(streaming_df$age, streaming_df$hours_watched,main="Age vs Gender vs Hours Watched",colour=
streaming_df$gender,xlab="Age",ylab="Hours Watched")+ labs(colour = 'Gender')
grid.arrange(qplot1,qplot2,qplot3,qplot4,nrow=2)

qplot5 <-qplot(streaming_df$social_metric, streaming_df$hours_watched,main="Social metric vs Hours Watched vs Gender",colour=
streaming_df$gender,xlab="Social Metric",ylab="Hours Watched")+ labs(colour = 'Gender') 
qplot6 <-qplot(streaming_df$demographic, streaming_df$hours_watched,main="Demographic vs Hours Watched vs Gender",colour=
streaming_df$gender,xlab="Demographic",ylab="Hours Watched")+ labs(colour = 'Gender') 
grid.arrange(qplot5 ,qplot6,nrow=1)

plot7 <- boxplot(streaming_df$hours_watched,xlab="Hours Watched)",
 main="Hours Watched")

There are several key variables that are linked via the sampling process.

According to the first graph,Hours Watched is the primary target variable and it looks like normally distributed across the whole dataset.
Second graph of Gender is another important variable and we can see that there’s a slight imbalanced across the dataset with males outnumbering females.
Age variable is plotted in the 4th graph and portrays that, Age is fairly distributed across the whole dataset. Further we can see a negative relationship between age and hours watched as the older customers watch lesser hours. However there’s no influence in the Gender with the hours viewing.
Social Metric variable in 5th graph displays that social metric is fairly distributed among the dataset.
According to the 6th graph, each demographic groups do not have mixed genders.
As per the boxplot graph,we can view that hours watched variable has some outliers, will check if that needs to be removed with the model.

Is there any bias in the data collection

According to the above analysis, it is visible that there’s gender imbalance and the dataset is Gender bias.
This could skew our test results and need to look into how this issue could be minimized when generating sampling methodologies.

Sampling methodology:

##Segment the data

##population data - pre-launch campaign data

streaming_df_pop <- streaming_df %>% subset(date<="2022-07-17")
streaming_df_pop_d1 <- streaming_df_pop %>% subset(demographic=="1")
streaming_df_pop_d2 <- streaming_df_pop%>% subset(demographic=="2")
streaming_df_pop_d3 <- streaming_df_pop %>% subset(demographic=="3")
streaming_df_pop_d4 <- streaming_df_pop %>% subset(demographic=="4")

##split by group and demographic * Group A after the launch which is from 18th July.

streaming_df_A <- streaming_df %>% subset(group=="A" & date >"2022-07-17")
streaming_df_Ad1 <- streaming_df_A %>% subset(demographic=="1")
streaming_df_Ad2 <- streaming_df_A %>% subset(demographic=="2")
streaming_df_Ad3 <- streaming_df_A %>% subset(demographic=="3")
streaming_df_Ad4 <- streaming_df_A %>% subset(demographic=="4")

Group B after the launch which is from 18th July.

streaming_df_B <- streaming_df %>% subset(group=="B" & date>"2022-07-17")
streaming_df_Bd1 <- streaming_df_B %>% subset(demographic=="1")
streaming_df_Bd2 <- streaming_df_B %>% subset(demographic=="2")
streaming_df_Bd3 <- streaming_df_B %>% subset(demographic=="3")
streaming_df_Bd4 <- streaming_df_B %>% subset(demographic=="4")

Further to the above sampling, employed a stratified sampling technique using gender and age.

How could any bias be corrected?

A stratified sampling technique has been used to minimize the gender bias in this dataset.
Which means stratified sampling helps retain the complete variety of the population in the sample.This has been displayed in the below graphs.

##Visualize demographic groupings by gender

#examine full dataset groupings

int_breaks_rounded <- function(x, n = 10) pretty(x, n)[round(pretty(x, n),1) %% 1 == 0]
gg1 <- ggplot()
gg1 <- gg1 + geom_point(position = position_jitter(width = 0.45,
 height = 0.45), 
 aes(x=streaming_df$demographic,
 y=streaming_df$hours_watched,
 colour=(streaming_df$gender)))
gg1 <- gg1 + scale_y_continuous(breaks= int_breaks_rounded )
gg1 <- gg1 + labs(x='Demographic', y='Hours Watched',
 colour='Gender')
gg1 <- gg1 + ggtitle("Full dataset")

#examine control group A groupings
int_breaks_rounded <- function(x, n = 10) pretty(x, n)[round(pretty(x, n),1) %% 1 == 0]
gg1a <- ggplot()
gg1a <- gg1a + geom_point(position = position_jitter(width = 0.45,
 height = 0.45), 
 aes(x=streaming_df_A$demographic,
 y=streaming_df_A$hours_watched,
 colour=(streaming_df_A$gender)))
gg1a <- gg1a + scale_y_continuous(breaks= int_breaks_rounded )
gg1a <- gg1a + labs(x='Demographic', y='Hours Watched',
 colour='Gender')
gg1a <- gg1a + ggtitle("Control Group A")

#examine treatment group B groupings
int_breaks_rounded <- function(x, n = 10) pretty(x, n)[round(pretty(x, n),1) %% 1 == 0]
gg1b <- ggplot()
gg1b <- gg1b + geom_point(position = position_jitter(width = 0.45,
 height = 0.45), 
 aes(x=streaming_df_B$demographic,
 y=streaming_df_B$hours_watched,
 colour=(streaming_df_B$gender)))
gg1b <- gg1b + scale_y_continuous(breaks= int_breaks_rounded )
gg1b <- gg1b + labs(x='Demographic', y='Hours Watched',
 colour='Gender')
gg1b <- gg1b + ggtitle("Treatment Group B")
grid.arrange(gg1,gg1a,gg1b,nrow = 2)

*According to the below graphs, its evident about the negative relationship between the age and viewing hours.

##Visualize demographic groupings by age
#examine full dataset groupings

int_breaks_rounded <- function(x, n = 10) pretty(x, n)[round(pretty(x, n),1) %% 1 == 0]
gg2 <- ggplot()
gg2 <- gg2 + geom_point(position = position_jitter(width = 0.45,
 height = 0.45), 
 aes(x=streaming_df$age,
 y=streaming_df$hours_watched,
 colour=(streaming_df$demographic)))
gg2 <- gg2 + scale_y_continuous(breaks= int_breaks_rounded )
gg2 <- gg2 + labs(x='Age', y='Hours Watched',
 colour='Demographic')
gg2 <- gg2 + ggtitle("Full dataset")

#examine control group A groupings

int_breaks_rounded <- function(x, n = 10) pretty(x, n)[round(pretty(x, n),1) %% 1 == 0]
gg2a <- ggplot()
gg2a <- gg2a + geom_point(position = position_jitter(width = 0.45,
 height = 0.45), 
 aes(x=streaming_df_A$age,
 y=streaming_df_A$hours_watched,
 colour=(streaming_df_A$demographic)))
gg2a <- gg2a + scale_y_continuous(breaks= int_breaks_rounded )
gg2a <- gg2a + labs(x='Age', y='Hours Watched',
 colour='Demographic')
gg2a <- gg2a + ggtitle("Control Group A")

#examine treatment group B groupings

int_breaks_rounded <- function(x, n = 10) pretty(x, n)[round(pretty(x, n),1) %% 1 == 0]
gg2b <- ggplot()
gg2b <- gg2b + geom_point(position = position_jitter(width = 0.45,
 height = 0.45), 
 aes(x=streaming_df_B$age,
 y=streaming_df_B$hours_watched,
 colour=(streaming_df_B$demographic)))
gg2b <- gg2b + scale_y_continuous(breaks= int_breaks_rounded )
gg2b <- gg2b + labs(x='Age', y='Hours Watched',
 colour='Demographic')
gg2b <- gg2b + ggtitle("Treatment Group B")
grid.arrange(gg2,gg2a,gg2b,nrow = 2)

Checking Group Balance

As we didn’t setup these groups ourselves it is always best to double check that they represent the total population.

count the numbers in each demographic category based on the A/B group

check_a_df <- streaming_df %>%
filter(group == "A"  ) %>%
select(gender,demographic, age)  %>%
group_by(gender,demographic, age) %>%
mutate(n_a = n()) %>%
distinct()

check_b_df <- streaming_df %>%
filter(group == "B"  ) %>%
select(gender,demographic, age)  %>%
group_by(gender,demographic, age) %>%
mutate(n_b = n()) %>%
distinct()

total numbers in each group

n_total_a <- sum(streaming_df$group=='A')
n_total_b <- sum(streaming_df$group=='B')

proportions in each demographic

check_a_df$p_a <- check_a_df$n_a / n_total_a
check_b_df$p_b <- check_b_df$n_b / n_total_b

join on demographic categories

check_df <- inner_join(check_a_df, check_b_df)

calculate the difference in proportions

check_df$diff <- check_df$p_a - check_df$p_b

if there is no bias aside from sampling noise then the difference should be small and normally distributed

qqnorm(y=check_df$diff)

##Examine the effect 1. Population estimate for mean and standard deviation

streaming_df_pop <- streaming_df %>% subset(date<="2022-07-17")
streaming_df_pop_mean <- mean(streaming_df_pop$hours_watched)
streaming_df_pop_sd <- sd(streaming_df_pop$hours_watched)
print(paste("Population Mean :", (round(streaming_df_pop_mean,digits=2))))

## [1] "Population Mean : 4.3"

print(paste("Population SD :", (round(streaming_df_pop_sd,digits=2))))

## [1] "Population SD : 1.29"

The effect between treatment and control group for each chosen demographic

ga_mean <- mean(streaming_df_A$hours_watched)
ga_d1_mean <- mean(streaming_df_Ad1$hours_watched)
ga_d2_mean <- mean(streaming_df_Ad2$hours_watched)
ga_d3_mean <- mean(streaming_df_Ad3$hours_watched)
ga_d4_mean <- mean(streaming_df_Ad4$hours_watched)
gb_mean <- mean(streaming_df_B$hours_watched)
gb_d1_mean <- mean(streaming_df_Bd1$hours_watched)
gb_d2_mean <- mean(streaming_df_Bd2$hours_watched)
gb_d3_mean <- mean(streaming_df_Bd3$hours_watched)
gb_d4_mean <-mean(streaming_df_Bd4$hours_watched)

print(paste("Mean of Treatment Group B :", (round(gb_mean,digits=2))))

## [1] "Mean of Treatment Group B : 4.81"

print(paste("Mean of Treatment Group B - D1 :", (round(gb_d1_mean,digits=2))))

## [1] "Mean of Treatment Group B - D1 : 5.75"

print(paste("Mean of Treatment Group B - D2 :", (round(gb_d2_mean,digits=2))))

## [1] "Mean of Treatment Group B - D2 : 5.71"

print(paste("Mean of Treatment Group B - D3 :", (round(gb_d3_mean,digits=2))))

## [1] "Mean of Treatment Group B - D3 : 4.22"

print(paste("Mean of Treatment Group B - D4 :", (round(gb_d4_mean,digits=2))))

## [1] "Mean of Treatment Group B - D4 : 4.28"

gab_mean_diff <- ga_mean-gb_mean
gabd1_mean_diff <- ga_d1_mean-gb_d1_mean
gabd2_mean_diff <- ga_d2_mean-gb_d2_mean
gabd3_mean_diff <- ga_d3_mean-gb_d3_mean
gabd4_mean_diff <- ga_d4_mean-gb_d4_mean

print(paste("Mean difference for Treatment Group B :", (round(gab_mean_diff,digits=2))))

## [1] "Mean difference for Treatment Group B : -0.41"

print(paste("Mean difference for Treatment Group B - Demographic 1 :",(round(gabd1_mean_diff,digits=2))))

## [1] "Mean difference for Treatment Group B - Demographic 1 : -0.55"

print(paste("Mean difference for Treatment Group B - Demographic 2:",(round(gabd2_mean_diff,digits=2))))

## [1] "Mean difference for Treatment Group B - Demographic 2: -0.59"

print(paste("Mean difference for Treatment Group B - Demographic 3 :",(round(gabd3_mean_diff,digits=2))))

## [1] "Mean difference for Treatment Group B - Demographic 3 : -0.66"

print(paste("Mean difference for Treatment Group B - Demographic 4 :",(round(gabd4_mean_diff,digits=2))))

## [1] "Mean difference for Treatment Group B - Demographic 4 : -0.59"

Calculate the minimum sample size (mss) for each demographic and check to see if the sample sizes for the demographics for treatment > mss

gb_sd <- sd(streaming_df_B$hours_watched)
gb_d1_sd <- sd(streaming_df_Bd1$hours_watched)
gb_d2_sd <- sd(streaming_df_Bd2$hours_watched)
gb_d3_sd <- sd(streaming_df_Bd3$hours_watched)
gb_d4_sd <-sd(streaming_df_Bd4$hours_watched)

mss_gb <- (1.96 * gb_sd / gab_mean_diff)^2
mss_gb_d1 <- (1.96 * gb_d1_sd / gabd1_mean_diff)^2
mss_gb_d2 <- (1.96 * gb_d2_sd / gabd2_mean_diff)^2
mss_gb_d3 <- (1.96 * gb_d3_sd / gabd3_mean_diff)^2
mss_gb_d4 <- (1.96 * gb_d4_sd / gabd4_mean_diff)^2

mss_gb <- round(mss_gb,digits=0)
mss_gb_d1 <- round(mss_gb_d1,digits=0)
mss_gb_d2 <-round(mss_gb_d2,digits=0)
mss_gb_d3 <-round(mss_gb_d3,digits=0)
mss_gb_d4 <-round(mss_gb_d4,digits=0)

sample_gb <- length(streaming_df_B$hours_watched)
sample_gb_d1 <- length(streaming_df_Bd1$hours_watched)
sample_gb_d2 <- length(streaming_df_Bd2$hours_watched)
sample_gb_d3 <- length(streaming_df_Bd3$hours_watched)
sample_gb_d4 <- length(streaming_df_Bd4$hours_watched)

print(paste("Treatment Group B sample size",sample_gb,"> Minimum required sample", mss_gb,"So the test is valid"))

## [1] "Treatment Group B sample size 120 > Minimum required sample 41 So the test is valid"

print(paste("Treatment Group B Demographic 1 sample",sample_gb_d1,"> Minimum required sample", mss_gb_d1,
"the test is valid"))

## [1] "Treatment Group B Demographic 1 sample 13 > Minimum required sample 6 the test is valid"

print(paste("Treatment Group B Demographic 2 sample",sample_gb_d2,"> Minimum required sample", mss_gb_d2,
"the test is valid"))

## [1] "Treatment Group B Demographic 2 sample 32 > Minimum required sample 13 the test is valid"

print(paste("Treatment Group B Demographic 3 sample",sample_gb_d3,"< Minimum required sample", mss_gb_d3,
"the test is *NOT* valid"))

## [1] "Treatment Group B Demographic 3 sample 16 < Minimum required sample 16 the test is *NOT* valid"

print(paste("Treatment Group B Demographic 4 sample",sample_gb_d4,"> Minimum required sample", mss_gb_d4,
"the test is valid"))

## [1] "Treatment Group B Demographic 4 sample 59 > Minimum required sample 15 the test is valid"

4. Perform statistical test on each demographic where you are testing the treatment mean against the population mean

z.test(streaming_df_B$hours_watched,mu=streaming_df_pop_mean,alternative = c("greater"),sigma.x =streaming_df_pop_sd, conf.level = 0.95)

## 
##  One-sample z-Test
## 
## data:  streaming_df_B$hours_watched
## z = 4.3696, p-value = 6.224e-06
## alternative hypothesis: true mean is greater than 4.296259
## 95 percent confidence interval:
##  4.617156       NA
## sample estimates:
## mean of x 
##  4.810875

z.test(streaming_df_Bd1$hours_watched,mu=streaming_df_pop_mean,alternative = c("greater"),sigma.x =streaming_df_pop_sd, conf.level = 0.95)

## 
##  One-sample z-Test
## 
## data:  streaming_df_Bd1$hours_watched
## z = 4.0499, p-value = 2.562e-05
## alternative hypothesis: true mean is greater than 4.296259
## 95 percent confidence interval:
##  5.156825       NA
## sample estimates:
## mean of x 
##  5.745385

z.test(streaming_df_Bd2$hours_watched,mu=streaming_df_pop_mean,alternative = c("greater"),sigma.x =streaming_df_pop_sd, conf.level = 0.95)

## 
##  One-sample z-Test
## 
## data:  streaming_df_Bd2$hours_watched
## z = 6.1824, p-value = 3.157e-10
## alternative hypothesis: true mean is greater than 4.296259
## 95 percent confidence interval:
##  5.331115       NA
## sample estimates:
## mean of x 
##   5.70625

z.test(streaming_df_Bd3$hours_watched,mu=streaming_df_pop_mean,alternative = c("greater"),sigma.x =streaming_df_pop_sd, conf.level = 0.95)

## 
##  One-sample z-Test
## 
## data:  streaming_df_Bd3$hours_watched
## z = -0.2345, p-value = 0.5927
## alternative hypothesis: true mean is greater than 4.296259
## 95 percent confidence interval:
##  3.690105       NA
## sample estimates:
## mean of x 
##  4.220625

z.test(streaming_df_Bd4$hours_watched,mu=streaming_df_pop_mean,alternative = c("greater"),sigma.x =streaming_df_pop_sd, conf.level = 0.95)

## 
##  One-sample z-Test
## 
## data:  streaming_df_Bd4$hours_watched
## z = -0.10033, p-value = 0.54
## alternative hypothesis: true mean is greater than 4.296259
## 95 percent confidence interval:
##  4.003135       NA
## sample estimates:
## mean of x 
##  4.279407

5.Interpret demographics if sample size > minimum sample size and if the effect was statistically significant

One sample z-test was conducted to determine if the Treatment group’s mean hours watched were significantly different from the population mean hours with the significance level 0.05
The result was p-value(0.001) < significance level (0.05) and therefore we can say that Treatment group means is significantly higher than the population mean.
Also performed this test to the sub groups within the treatment group to find out if there’s a statistically significant difference between the means.
Results are as below:

streaming_df_B %>% group_by(demographic,gender) %>% summarise( n = n(),
 Min_Age = min(age,na.rm = TRUE), 
 Max_Age = max(age,na.rm = TRUE),
 Mean_hrs_watch =mean(hours_watched,na.rm = TRUE),
 SD_hrs_watch = sd(hours_watched, na.rm = TRUE))

Demographic 1 : Females aged 18-34 years recorded a mean of 5.75 hours which is greater than the population mean of 4.3 hours. According to the z- test, p-value(0.001) < significance level (0.05) .Therefore the treatment group D1 mean hours are significantly higher than the population mean hours.
Demographic 2 : Males aged 18-35 years recorded a mean of 5.71 hours which is greater than the population mean of 4.3 hours. According to the z- test, p-value(0.001) < significance level (0.05) .Therefore the treatment group D2 mean hours are significantly higher than the population mean hours.
Demographic 3 : Females aged 38-55 years recorded a mean of 4.22 hours which is almost similar to the population mean of 4.3 hours. According to the z- test, p-value(0.5927) > significance level (0.05) .Therefore the treatment group D3 mean hours and the population mean hours do not have a significant difference..
Demographic 4 : Males aged 36-55 years recorded a mean of 4.28 hours which is almost similar to the population mean of 4.3 hours. According to the z- test, p-value( 0.54) > significance level (0.05) .Therefore the treatment group D4 mean hours and the population mean hours do not have a significant difference..

6.Use regression to determine best predictors for target variable

6.1 Model I - Building a linear regression model

According to the below results we can say that hours watched & age is coefficient.
R^2 indicates a moderate fit for the model and age only explains 32.8% of variability in hours watched.
p value (.001)< significance level (0.05). This indicates that, there’s statistically significant evidence that the data fits a linear regression model.
Keeping this aside and trying another model with more variables.

model1 <- lm(hours_watched~age,streaming_df)
model1

## 
## Call:
## lm(formula = hours_watched ~ age, data = streaming_df)
## 
## Coefficients:
## (Intercept)          age  
##     7.00012     -0.07145

summary(model1)

## 
## Call:
## lm(formula = hours_watched ~ age, data = streaming_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5696 -0.7474  0.0021  0.7706  2.9391 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.000118   0.123038   56.89   <2e-16 ***
## age         -0.071453   0.003236  -22.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.093 on 998 degrees of freedom
## Multiple R-squared:  0.3282, Adjusted R-squared:  0.3275 
## F-statistic: 487.5 on 1 and 998 DF,  p-value: < 2.2e-16

Checking the below assumptions for the modelI.

par(mfrow = c(2,2))
plot(model1)

par(mfrow =c(1,1))

Linearity - according to the above plot (Residual Vs Fitted), displays that the relationship between fitted values and residuals is flat (demonstrated by the flat red line). This indicates a linear relationship is being modeled.
Normality of Residulas - the Q- Q plot suggests there are no major deviations from normality. It would be safe to assume the residuals are approximately normally distributed.
Homoscedasticity - Residuals v Fitted Values plot indicates that the variance appears to remain the same especially given that red line looks quite straight, Homoscedasticity can therefore be assumed. The Scale-Location plot also indicates that homoscedasticity can be assumed given the red line appears relatively flat and the variance in the square root of the standardized residuals looks to be consistent across predicted (fitted values).
Outliers - Residuals vs Leverage plot isn’t showing any extreme outliers that are unduly influencing the model.

6.2 Model II - Building a Multiple linear regression model

According to the below result only age and social metric is coefficient.
R^2 could increase due to adding more variables and therefore building a model, only with coefficient variables.

model2 <- lm(hours_watched~age+gender+social_metric+time_since_signup+demographic, streaming_df)
model2

## 
## Call:
## lm(formula = hours_watched ~ age + gender + social_metric + time_since_signup + 
##     demographic, data = streaming_df)
## 
## Coefficients:
##       (Intercept)                age            genderF      social_metric  
##          6.173435          -0.063715           0.122741           0.097813  
## time_since_signup       demographic2       demographic3       demographic4  
##          0.002633           0.122707          -0.249048                 NA

summary(model2)

## 
## Call:
## lm(formula = hours_watched ~ age + gender + social_metric + time_since_signup + 
##     demographic, data = streaming_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6505 -0.6608 -0.0184  0.7068  2.7745 
## 
## Coefficients: (1 not defined because of singularities)
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        6.173435   0.299196  20.633   <2e-16 ***
## age               -0.063715   0.006128 -10.397   <2e-16 ***
## genderF            0.122741   0.145012   0.846    0.398    
## social_metric      0.097813   0.011165   8.761   <2e-16 ***
## time_since_signup  0.002633   0.004628   0.569    0.570    
## demographic2       0.122707   0.142342   0.862    0.389    
## demographic3      -0.249048   0.153499  -1.622    0.105    
## demographic4             NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.054 on 993 degrees of freedom
## Multiple R-squared:  0.3785, Adjusted R-squared:  0.3747 
## F-statistic: 100.8 on 6 and 993 DF,  p-value: < 2.2e-16

6.3 Model III - Building a Multiple linear regression model

According to the below results R^2 indicates a moderate fit for the model.
p value (.001)< significance level (0.05). This indicates that, there’s statistically significant evidence that the data fits a linear regression model.

model3 <- lm(hours_watched~age+social_metric, streaming_df)
model3

## 
## Call:
## lm(formula = hours_watched ~ age + social_metric, data = streaming_df)
## 
## Coefficients:
##   (Intercept)            age  social_metric  
##       6.49661       -0.07079        0.09757

summary(model3)

## 
## Call:
## lm(formula = hours_watched ~ age + social_metric, data = streaming_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7039 -0.6636 -0.0157  0.7398  2.8434 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.496610   0.131812   49.29   <2e-16 ***
## age           -0.070786   0.003121  -22.68   <2e-16 ***
## social_metric  0.097573   0.011139    8.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.054 on 997 degrees of freedom
## Multiple R-squared:  0.3762, Adjusted R-squared:  0.3749 
## F-statistic: 300.6 on 2 and 997 DF,  p-value: < 2.2e-16

Checking the below assumptions for the model III.

par(mfrow = c(2,2))
plot(model3)

par(mfrow =c(1,1))

Linearity - according to the above plot (Residual Vs Fitted), displays that the relationship between fitted values and residuals is flat (demonstrated by the flat red line). This indicates a linear relationship is being modeled.
Normality of Residuals - the Q- Q plot suggests there are no major deviations from normality. It would be safe to assume the residuals are approximately normally distributed.
Homoscedasticity - Residuals v Fitted Values plot indicates that the variance appears to remain the same especially given that red line looks quite straight, Homoscedasticity can therefore be assumed. The Scale-Location plot also indicates that homoscedasticity can be assumed given the red line appears relatively flat and the variance in the square root of the standardized residuals looks to be consistent across predicted (fitted values).
Outliers - Residuals vs Leverage plot isn’t showing any extreme outliers that are unduly influencing the model. Found some outliers in the hours watched variable when I was processing the data but dint treat them and going to run the model III without any outliers and see if it makes any considerable difference.

Removing the Outliers in hours watched variable

summary (streaming_df$hours_watched)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.500   3.530   4.415   4.393   5.322   8.300

##boxplot_with_outliers <- boxplot (streaming_df$hours_watched,xlab="Hours Watched",main=" Hours Watched with Outliers")

Removing the extremes above 8 and below 0.8

streaming_no_outliers <- streaming_df %>% filter(hours_watched < 8 & hours_watched > 0.8)
boxplot_no_outliers <- boxplot(streaming_no_outliers$hours_watched,xlab="Hours Watched",main="Hours Watched without Outliers")

##### Running the model III as in model IV with the cleaned dataset to see if there’s any significant change

According to the below results I cannot see any significant changes as p value is the same as model 3 and R^2 has increased by 0.01 and therefore I will be sticking with the model 3 as I have already verified the assumptions for the model 3.

model4 <- lm(hours_watched~age+social_metric, streaming_no_outliers)
model4

## 
## Call:
## lm(formula = hours_watched ~ age + social_metric, data = streaming_no_outliers)
## 
## Coefficients:
##   (Intercept)            age  social_metric  
##       6.45653       -0.06895        0.09303

summary(model4)

## 
## Call:
## lm(formula = hours_watched ~ age + social_metric, data = streaming_no_outliers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6687 -0.6661 -0.0296  0.7232  2.8339 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.456530   0.130228   49.58   <2e-16 ***
## age           -0.068953   0.003096  -22.27   <2e-16 ***
## social_metric  0.093034   0.011023    8.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.04 on 991 degrees of freedom
## Multiple R-squared:  0.3667, Adjusted R-squared:  0.3654 
## F-statistic: 286.8 on 2 and 991 DF,  p-value: < 2.2e-16

Major Findings on the investigation

Findings of the Dataset

There was a gender imbalance across the dataset which was indicated in the samples whereby males clearly outnumbered females
There was a clear negative relationship between age and hours watched
All these studies are vital as these could be used for future marketing opportunities for the company.

A/B Test Findings

It was possible to administer the A/B Test as the overall Treatment Group B sample size 120 was larger than the minimum sample size of 41.
Also it was possible to perform the test on 3 of the sub groups (Demographic group 1,2,4) as they all had the sample larger than the minimum sample size.
Demographic group 3 had the minimum sample size (16) which is not larger than the sample size (16) therefore the performing the test wasn’t possible.
According to the results of the one sample Z test, treatment group B mean hours are statistically significantly higher than the population mean hours for Demographic 1: females aged 18-to-34-years *Demographic 2: males aged 18-to-35-years
Demographic 4: males aged 36-to-55-years mean hours are statistically significantly not higher than the population mean hours according to the one sample z test.

Question: Is the new recommendation engine algorithm is worth rolling out to WWN subscribers?

The AB test results imply the fact that it is worth rolling out the new recommendation engine algorithm to WNW viewers given the statistically significant higher mean hours watched for Treatment Group B compared to Control Group A. This suggest that the algorithm appears to have increased viewer engagement measured by the increase in hours watched overall. Also we can see how the algorithm have influenced on viewing habits for different demographic groups based on age and gender.
Further we can emphasizes the fact that algorithm had the strongest impact on younger viewers aged 35 years and under especially for younger females (5.75 hours) whose viewing hours was slightly higher than younger males (5.71 hours). Further test should be conducted to determine if this relative difference between these two subgroups are valid.
The AB test results indicate that the algorithm did not influence on older male viewers aged over 36 years given that the mean hours was not statistically significantly higher than that of the control group.
A linear regression test conducted on age as the predictor variable and hours_watched as the response variable. Although there was statistically significant evidence that the data fitted the model, the R-squared indicated that age only accounted for around 32.8% of variabilty in hours watched. However, the slope of this test indicated statistically significant evidence that age is negatively related to hours watched and this matches with the negative relationship highlighted in the plots created to check the characteristics of the variables.All the assumptions were validated for the model.
Multiple regression analysis was conducted using age and social metrics as the predictor variables and hours watched as the response variable, This test validated the fact that that overall there was statistically significant evidence that the data fits a linear regression model for the two predictor variables of age and social metric scores. However, the R-squared score indicated that the combination of these two predictor variables only explain about 37.6% of the variability in hours watched. The slope coefficient for age indicated a statistically significant negative relationship with hours watched and there was statistically significant evidence for a positive relationship for most of the individual social metric scores.
This could depict that people with a higher the social metric score potentially watch more compared to people with lower scores. There is further investigations that would need to be performed for the linear regression test in terms of investigation the relative sample proportions of viewers in both the control and treatment groups.

Discuss any strengths ,limitations and future direction

Some of the major advantages of this study were the fairly large sample size of about 548 viewers in the control and treatment groups during the campaign which lends further weight to the validity of the results.
The subgroups were clearly identified in the data and present future opportunities for further investigations into hours watched for younger and older cohorts of viewers.
The foremost limitation to this investigation was the limited sample size especially for demographic group 3 (older females). Therefore, In future, a much larger sample size would give more precise results in terms of estimating the true change in hours watched for the entire viewing population and would also give more scope to test within subgroups as well given that minimum sample sizes are likely to be easily met.
The dataset was limited also to only the month of July which may have also skewed the data somewhat due to season factors such as winter and the start of semester 2 at university for students and also starting of schools term 3 after holidays. As such, running the campaign over a much longer period of time would perhaps give a more balanced result in terms of seasonal factors throughout the year. I suggest a six month period stretching from the middle of the winter to the middle of summer could capture more comprehensive dataset.
Another potential limitation of the investigation could be the gender imbalance that was reflected in the overall population dataset and then simulated in the samples. However employing a stratified sampling technique could minimize the gender imbalance.

.* In future increasing the sample size and running a longer campaign is high costly but considering the positive feedbacks received imply that allocating a budget for these enhancements are justifiable.

Overall Conclusion

It is acceptable to conclude that, new recommendation engine algorithm is worth rolling out especially for the younger viewers under the age of 36.
We couldn’t come up with a proper test results for the older viewers as the female test was not valid due to insufficient sample size and the male viewers test was not statistically significant. Therefore it will be worthwhile to go through some further investigation regarding this criteria as there’s a negative relationship between the age and watched hours.
Further investigations should be made into the relationships between social metric scores, age and hours watched so that we might be able to better predict viewing habits and better target or increase potential advertising revenue from sponsors
It is very well advised to run another campaign with a larger sample size and over a longer period of time to actually test the viewing habits of the subgroups and to achieve more accurate results and benefits.
Higher longer time hours watched and better predicting viewing habits relate to a growing customer base and more advertising revenue from highly recognized sponsors but need to outweigh the higher cost of larger sample sizes and longer running campaigns.

References

Sheenal Srivastava (2020),Towards Data Science: Towards Data Science, accessed 10 October 2022.https://medium.com/@sheenalsrivastava
Phillip Cleave (2019),Smart Survey: Ways to avoid sampling bias in surveys, accessed 11 October 2022. https://www.smartsurvey.co.uk/blog/ways-to-avoid-sampling-bias-in-surveys