Assessment 3

library(ggplot2)      
library(dplyr)
library(readr)
library(here) 
library(magrittr) 
library(psych)

Load Streaming data

stream <- "streaming_data.csv"
stream_df <- read.csv(here(stream), header = TRUE)
stream_df$gender <- gsub("F", "0", stream_df$gender) #change gender F to 0
stream_df$gender <- gsub("M", "1", stream_df$gender) #change gender M to 1
stream_df$gender <- as.numeric(stream_df$gender) #convert gender from character to numeric
streamF <- filter(stream_df, stream_df$gender == 0)  # select only females
streamM <- filter(stream_df, stream_df$gender == 1) # select only males

summary(stream_df)      # Inspect the dataset for possible errors
sum(is.na(stream_df))   # Check the dataset to make sure that there are no missing values
mean(is.na(stream_df))  # Check the dataset to make sure that there are no missing values

From the above checks, it is confirmed that the dataset is clean.

A1 <- filter(stream_df, stream_df$group == "A") #Select group A from start till end
A <- subset(A1, select = -c(date, group)) #Exclude date and group

Scatter matrix and correlation

pairs.panels(A[,-50], 
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
             )

From the scatter matrix graph and the correlation numbers, the correlation between age and demographic is 0.76. Only one of them will be included in the multiple linear regression (MLR) model to reduce the multicollinearity problem. Since the hours_watched (hrw) will be used as the dependent variable, and the correlation between hrw and age is -0.59, higher than that between hrw and demographic of -0.45. age is chosen as the preferred independent variable, together with social_metric. The other factors are not included due to their very low correlation (less than 0.03) with hrw.

It is also observed that only the hours_watched displays a normal distribution.

Multiple Linear Regression of all A (MLRA)

MLRA <- lm(hours_watched ~ age + social_metric, data = A) 

summary(MLRA)

The ANOVA for the MLRA shows that the p values for age and social_metric are significantly less than 1%, and the two variables will be retained in the multiple regression model.

The beta coefficient of the intercept suggests that the hour_watched is at least 6.54 hours when the independent variables are zero. This could come from factors not captured by the MLR. e.g. During covid lockdown, people generally spend more time streaming as they could not go to work and participate in other activities.

The beta coefficient for age is -0.07, this is consistent with the scatter plot above where a weak negative relationship between hours_watched and age is also observed. This means that older people do not stream as often as the younger people.

The beta coefficient for social_metric is 0.085. It is observed that the relationship between hours_watched and social-metric is not very clear as the data spread widely in the scatter plot, this can be explained by the nature of the data as they are expressed as categorical values between 0 and 10. However, the weak and positive relationship is significant and confirmed in the MLR.

Checking assumptions of Linear Models (fit1=hrw vs age, fit2=hrw vs social_metric)

Linearity-The scatter matrix above shows that there is a linear relationship between hrw and age. Hence satisfying the linearity assumption.
Constant variance-To check this assumption, a linear model hrw vs age is created (fit1).

y <- A[, 6] # Select hours_watched as y
x <- A[, 2] # Select age as x
fit1 = lm(y ~ x)

Plot fit1 over the data

# setup x variable with the min/max range
a0 <- coef(fit1)[1]
a1 <- coef(fit1)[2]
xfit1 <- seq(min(x), max(x), 1)

yfit1 <- a0 + a1 * xfit1 # calculate the fitting function yfit1

Plot hours_watched vs age

gg <- ggplot()
gg <- gg + geom_point(aes(x = x, y = y, colour ='Observed'))
gg <- gg + geom_line(aes(x = xfit1, y = yfit1), colour = 'black')
gg <- gg + labs(x = 'age', y = 'hours_watched')
gg

The hours_watched vs age graph of the observations and the fitted line (fit1) shows that the errors are evenly distributed around the fitted line, thus the assumption of constant variance is satisfied.

Check normal distribution of the error term

Af1 <- a0 + a1 * A$age  #Af1 is the forecast hours_watched based on the original age observations

A$e1 <- A$hours_watched - Af1 # calculate the residual
gg <- ggplot()
gg <- gg + geom_point(aes(x, y = A$e1))
gg <- gg + labs(x = 'x', y = 'residual')
gg

qqnorm(A$e1, 
        main = 'QQ Plot for residual1 of hours_watched', 
        xlab = 'Theoretical Quantiles', ylab = 'Residual1', col="violet")
qqline(A$e1)

The QQPlot of the residual of hours_watched shows that the residual satisifies the normality assumption.

Check histogram of residual

gg <- ggplot()
gg <- gg + geom_histogram(aes(x = A$e1), bins = 10, colour = "black") 
gg <- gg + labs(title = "Residual1 of hours_watched", x = "Residual1") 
gg

The histogram confirms that the residual of hours_watched is normally distributed.

The independence assumption is satisfied when the high correlation (0.76) between age and demographic was recognised in the scatter matrix and age was chosen in the MLR and the mulitcollinearity issue is reduced.

Check the assumptions for hours_watched vs social_metric

Linearity-The scatter plot shows that there is a linear relationship between hrw and social_metric. Hence satisfying the linearity assumption.
Constant variance

Create a linear model hrw vs social_metric

y2 <- A[, 6] # Select hours_watched as y
x2 <- A[, 3] # Select social_watched as x2
fit2 = lm(y2 ~ x2)

summary(fit2)

Plot fit2 over the data

# setup x2 variable with the min/max range

xfit2 <- seq(min(x2), max(x2), 1)
b0 <- coef(fit2)[1]
b1 <- coef(fit2)[2]
yfit2 <- b0 + b1 * xfit2 # calculate the fitting function yfit2

gg <- ggplot()
gg <- gg + geom_point(aes(x = x2, y = y2, colour='observed'))
gg <- gg + geom_line(aes(x = xfit2, y = yfit2), colour = 'black')
gg <- gg + labs(x = 'social_Metric', y = 'hours_watched')
gg

The hours_watched vs social_metric graph of the observations and the fitted line (fit2) shows that the errors are evenly distributed around the fitted line, thus the assumption of constant variance is satisfied.

Check normal distribution of the error term

Af2 <- b0 + b1 * A$social_metric  #Af2 is the forecast hours_watched based on the original age observations

A$e2 <- A$hours_watched - Af2  # calculate the residual
gg <- ggplot()
gg <- gg + geom_point(aes(x, y = A$e2))
gg <- gg + labs(x = 'x', y = 'residual')
gg

qqnorm(A$e2, 
        main = 'QQ Plot for residual2 of hours_watched', 
        xlab = 'Theoretical Quantiles', ylab = 'Residual', col="violet")
qqline(A$e2)

The QQPlot of the residual of hours_watched shows that the residual satisfies the normality assumption.

Check histogram of residual

gg <- ggplot()
gg <- gg + geom_histogram(aes(x = A$e2), bins = 10, colour = "black") 
gg <- gg + labs(title = "Residual2 of hours_watched", x = "Residual2") 
gg

The histogram confirms that the residual of hours_watched is normally distributed.

The independence assumption is satisfied when the correlation between age and social_metric was found to be low (0.21) in the scatter plot.

Examine the characteristics of the variables.

Streaming behaviour of old vs young

stream_df <- read.csv(here(stream), header = TRUE)

p1<- ggplot(stream_df, aes(x=age, y=hours_watched)) +
  geom_point(size=2, shape=23)
p1

The plot shows that older people spend less time streaming than younger people.

Check for the streaming behaviour between Female and Male

p2<- ggplot(stream_df, aes(x=gender, y=hours_watched)) +
  geom_point(size=2, shape=23)

p2

Difference of hrw between female and male

Fmeanhrw <- mean(streamF$hours_watched)  #Average hrw of females (0)
Mmeanhrw <- mean(streamM$hours_watched)  #Average hrw of males (1)

Average hrw watched by female is 4.35 hours. Average hrw watched by male is 4.42 hours. On average, male spend slighly more time (4.2 min) than female, and the plot shows that there are more males streaming between 7 to 8 hrw than females.

Check age distribution

stream_df2 <- read.csv(here(stream), header = TRUE)

p4 <-ggplot(data=stream_df2, aes(x=age, y=hours_watched)) +
  geom_bar(stat="identity")
p4

The hours_watched for ages 20, 23 and 28 exceeds 175 hours.

Age would be a logical divider in terms of demographic groups of interest in later analysis.

Different marketing strategies would be needed based on age and gender.

gg <- ggplot(stream_df2, aes(x = age, fill = gender))
gg <- gg + geom_histogram(alpha = 0.5, position = "identity", binwidth = 1)
gg

Clusters

int_breaks_rounded <- function(x, n = 10) pretty(x, n)[round(pretty(x, n),1) %% 1 == 0]

gg <- ggplot()
gg <- gg + geom_point(aes(x = stream_df2$age,
                          y = stream_df2$hours_watched))

gg <- gg + scale_y_continuous(breaks = int_breaks_rounded )

gg <- gg + labs(x = 'Age (years)', y = 'hours_watched')
gg

A clear split in this data is when the hours_watched is 7. The plot shows that for ages below 35 many of the hours_watched exceed 7 hours.

Check for any bias in A/B based on WNW’s selection

Calculate proportions of demographic (demo) distribution in A

Separate demographic into 1 to 4

post177 <- stream_df[549:1000,] #Select rows from 18 July till end
M2A <- filter(post177, post177$group == "A") #Select group A from 18 July till end

AM2demo1 <- filter(M2A, M2A$demographic == 1) # contains demographic = 1
AM2demo2 <- filter(M2A, M2A$demographic == 2) # contains demographic = 2
AM2demo3 <- filter(M2A, M2A$demographic == 3) # contains demographic = 3
AM2demo4 <- filter(M2A, M2A$demographic == 4) # contains demographic = 4
AM2demo  <- filter(M2A, M2A$demographic <= 5) # contains demographic = 1 to 4

Ademo1Count <- AM2demo1%>% count(demographic) # Count of demographic = 1
Ademo2Count <- AM2demo2%>% count(demographic) # Count of demographic = 2
Ademo3Count <- AM2demo3%>% count(demographic) # Count of demographic = 3
Ademo4Count <- AM2demo4%>% count(demographic) # Count of demographic = 4
AdemoCount <- AM2demo%>% count(demographic) # Count of all demographic in A

Ademo1C <- sum(Ademo1Count[,2]) # Extract count of demo = 1
Ademo2C <- sum(Ademo2Count[,2]) # Extract count of demo = 2
Ademo3C <- sum(Ademo3Count[,2]) # Extract count of demo = 3
Ademo4C <- sum(Ademo4Count[,2]) # Extract count of demo = 4
AdemoC <- sum(AdemoCount[,2]) # Extract count of all demo
ATot_demo <- AdemoC

#Calculate proportions in A
Ademo1 <- round((Ademo1C/ATot_demo) * 100, digits= 2) 
Ademo2 <- round((Ademo2C/ATot_demo) * 100, digits= 2)
Ademo3 <- round((Ademo3C/ATot_demo) * 100, digits= 2)
Ademo4 <- round((Ademo4C/ATot_demo) * 100, digits= 2)

Calculate proportions of demographic (demo) distribution in B

Separate demographic into 1 to 4

M2B <- filter(post177, post177$group == "B")
BM2demo1 <- filter(M2B, M2B$demographic == 1) # contains demographic = 1
BM2demo2 <- filter(M2B, M2B$demographic == 2) # contains demographic = 2
BM2demo3 <- filter(M2B, M2B$demographic == 3) # contains demographic = 3
BM2demo4 <- filter(M2B, M2B$demographic == 4) # contains demographic = 4
BM2demo  <- filter(M2B, M2B$demographic <= 5) # contains demographic = 1 to 4

Bdemo1Count <- BM2demo1%>% count(demographic) # Count of demographic = 1
Bdemo2Count <- BM2demo2%>% count(demographic) # Count of demographic = 2
Bdemo3Count <- BM2demo3%>% count(demographic) # Count of demographic = 3
Bdemo4Count <- BM2demo4%>% count(demographic) # Count of demographic = 4
BdemoCount <- BM2demo%>% count(demographic) # Count of all demographic in A

Bdemo1C <- sum(Bdemo1Count[,2]) # Extract number of demo = 1
Bdemo2C <- sum(Bdemo2Count[,2]) # Extract number of demo = 2
Bdemo3C <- sum(Bdemo3Count[,2]) # Extract number of demo = 3
Bdemo4C <- sum(Bdemo4Count[,2]) # Extract number of demo = 4
BdemoC <- sum(BdemoCount[,2]) # Extract number of all demographic in B
BTot_demo <- BdemoC

#calculate proportions in B
Bdemo1 <- round((Bdemo1C/BTot_demo) * 100, digits= 2)
Bdemo2 <- round((Bdemo2C/BTot_demo) * 100, digits= 2)
Bdemo3 <- round((Bdemo3C/BTot_demo) * 100, digits= 2)
Bdemo4 <- round((Bdemo4C/BTot_demo) * 100, digits= 2)

Check group balances

WNW setup the A/B groups, check if A/B groups represent the total population.

Count the numbers in each demographic category based on the A/B group from 18/7

check_a_df <- M2A %>% 
  select(demographic) %>% 
  group_by(demographic) %>% 
  mutate(n_a = n()) %>% 
  distinct()
  
check_b_df <- M2B %>% 
select(demographic) %>% 
  group_by(demographic) %>% 
mutate(n_b = n()) %>% 
distinct()

n_total_a <- sum(M2A$group == 'A') # total numbers in A
n_total_b <- sum(M2B$group == 'B') # total numbers in A

check_a_df$p_a <- check_a_df$n_a / n_total_a 
check_b_df$p_b <- check_b_df$n_b / n_total_b
check_df <- inner_join(check_a_df, check_b_df) # join on demo categories

# calculate the difference in proportions
check_df$diff <- check_df$p_a - check_df$p_b  

# if there is no bias apart from sampling noise then the difference should be small and normally distributed

qqnorm(check_df$diff, 
        main = 'Residual of difference in demographic  proportions between A and B )', 
        xlab = 'Theoretical Quantiles', ylab = 'Residual1', col="violet")
        
qqline(check_df$diff)

The plot shows that the error is not normally distributed and showing a left skewed distribution. This is confirmed in the histogram below.

Histogram of error in the demographic distribution

# total numbers in each group
n_total_a <- sum(M2A$group == 'A')
n_total_b <- sum(M2B$group == 'B')

check_a_df$p_a <- check_a_df$n_a / n_total_a # proportions in demographic A
check_b_df$p_b <- check_b_df$n_b / n_total_b # proportions in demographic B

check_df <- inner_join(check_a_df, check_b_df) # join on demo categories

check_df$diff <- check_df$p_a - check_df$p_b  # difference in proportions

gg <- ggplot()
gg <- gg + geom_histogram(aes(x = check_df$diff), bins = 10, colour = "black") 
gg <- gg + labs(title = "Residual of difference in demographic proportions between A and B", x = "Residual") 
gg

Difference in demographic proportions between A and B


demo1ABdiff <- Ademo1 - Bdemo1 #% Difference bet demographic 1 in A and B
demo2ABdiff <- Ademo2 - Bdemo2 #% Difference bet demographic 2 in A and B
demo3ABdiff <- Ademo3 - Bdemo3 #% Difference bet demographic 3 in A and B
demo4ABdiff <- Ademo4 - Bdemo4 #% Difference bet demographic 4 in A and B

Summary of differences in proportion in A and B.

Demographic 1 = 11.46%

Demographic 2 = 2.25%

Demographic 3 = 13.47%

Demographic 4 = -27.18%

The differences in demographic proportions in A and B shown in the analyses above confirm that the data are biased.

As the data is biased, A and B cannot represent the whole population. However, the sub-sample inside the demographic of A and B can still represent the population in that demographic as they have the same demographic characteristics.

To check the differences in average hours_watched for each demographic group between A and B.

# Compare sd

demo1diff <- sd(AM2demo1$hours_watched) - sd(BM2demo1$hours_watched)
demo2diff <- sd(AM2demo2$hours_watched) - sd(BM2demo2$hours_watched)
demo3diff <- sd(AM2demo3$hours_watched) - sd(BM2demo3$hours_watched)
demo4diff <- sd(AM2demo4$hours_watched) - sd(BM2demo4$hours_watched)

The sd between the demographic groups in A/B are found to be unequal, therefore Welsh tests are used in the t test.

# Compare means

demo1Mdiff <- mean(AM2demo1$hours_watched) - sd(BM2demo1$hours_watched)
demo2Mdiff <- mean(AM2demo2$hours_watched) - sd(BM2demo2$hours_watched)
demo3Mdiff <- mean(AM2demo3$hours_watched) - sd(BM2demo3$hours_watched)
demo4Mdiff <- mean(AM2demo4$hours_watched) - sd(BM2demo4$hours_watched)

The differences in mean hrw between the demographic groups in A/B are: demo1=4.52 , demo2=4.00. demo3=2.20, dem4=2.51

Hypothesis Tests

Null Hypothesis: The means of A and B are the same Alternative Hypothesis: The means of A and B are not the same

t.test(AM2demo1$hours_watched, BM2demo1$hours_watched, alternative = "two.sided", var.equal = FALSE) # demographic = 1

t.test(AM2demo2$hours_watched, BM2demo2$hours_watched, alternative = "two.sided", var.equal = FALSE) # demographic = 2

t.test(AM2demo3$hours_watched, BM2demo3$hours_watched, alternative = "two.sided", var.equal = FALSE) # demographic = 3

t.test(AM2demo4$hours_watched, BM2demo4$hours_watched, alternative = "two.sided", var.equal = FALSE) # demographic = 4

The t tests show that only demographic 3 is insignificant at 5% level of significance. It is concluded that the alternative hypothesis cannot be rejected and that the mean of the hours_watched in demographic 1, 2 and 4 are significantly different between A and B.

Correction of bias in A/B

Use the effect calculated from 18/7 to 31/7, and the proportions from the whole population from 1/7 to 31/7 to estimate the weighted average of the final effect.

Effect=difference of mean hours_watched from 18/7

Ademo1<- mean(AM2demo1$hours_watched)
Ademo2<- mean(AM2demo2$hours_watched)
Ademo3<- mean(AM2demo3$hours_watched)
Ademo4<- mean(AM2demo4$hours_watched)

Bdemo1<- mean(BM2demo1$hours_watched)
Bdemo2<- mean(BM2demo2$hours_watched)
Bdemo3<- mean(BM2demo3$hours_watched)
Bdemo4<- mean(BM2demo4$hours_watched)

# Effect is defined as mean of hrw B - mean of hrw A

demo1E1 <- Bdemo1 - Ademo1 #Effect of demo 1
demo2E2 <- Bdemo2 - Ademo2 #Effect of demo 2
demo3E3 <- Bdemo3 - Ademo3 #Effect of demo 3
demo4E4 <- Bdemo4 - Ademo4 #Effect of demo 4

Count the proportion of demographic group based on the complete dataset from 1/7 to 31/7.

check_w_df <- stream_df %>% 
select(demographic) %>% 
  group_by(demographic) %>% 
mutate(n_w = n()) %>% 
distinct()

Wn_total1 <- count(stream_df, "demographic")
Wn_total <- Wn_total1[,2]

Weighted Average of the Final Effect

check_w_df$p_w <- check_w_df$n_w / Wn_total
df2 <- arrange(check_w_df, demographic) #sort demographic in order
Effect <- c(demo1E1, demo2E2, demo3E3, demo4E4)
Prop <- df2$p_w

WAFinalE <- round(sum(Effect * Prop), digits=4) # Weighted Average final effect

The Weighted Average of the final effect of hours_watched due to the new recommendation engine is estimated to be 0.5975 hours (about 36 min)

Assessment 3

Wing Siang Chang, s3918329

24/2/2022