Introduction

The streaming company called “Why Not Watch” wants to improve their recommendation engine to provide better recommendation to customers. the better recommendations increase user engagement and increase the average hours watched per user per day which is the key metric used to price ads for the third party marketing companies.

Problem Statement

The overall question of this investigation is whether or not the new recommendation engine algorithm is worth launching. There are a few more important questions need to be answered to ensure the next investigation will be more effective as follow;

Is there any bias in the data collection?
How could any bias be corrected?
What improvements could be made to future A/B tests?

Data

The important variables.
- date : The date in which the data are corrected. I converted date variable from character type in to date type make it more convenient to subset the data into separate groups.
- gender : The gender of the customer. I made this variable to be factor with 2 levels, Male and Female.
- age : The age of the customer.
- social_metric : The socail metric is a combined metric based on previous viewing habits
- time_since_signup : number of months since the customer signed up.
- demographic : The demographic number of 4 types of customers
- group : The groups of customer where A is the control and B is the treated group
- hours_watched : number of hours watched in that day
If you have factors, explain each level.
Explain the scale of numeric variables.
Justify everything that you do to preprocess the data.

#Load data
data <- read_csv("/Users/marshall/Desktop/Applied analytics/Assessment 3/streaming_data.csv")
data$date %<>%  as.Date(data$date, format = "%d/%m") #convert to date
data$gender %<>% as.factor() #convert to factor
data$group %<>% as.factor() #convert to factor
summary(data) #check the summary of each variable

##       date            gender       age        social_metric   
##  Min.   :2022-07-01   F:429   Min.   :18.00   Min.   : 0.000  
##  1st Qu.:2022-07-08   M:571   1st Qu.:28.00   1st Qu.: 2.000  
##  Median :2022-07-16           Median :36.00   Median : 5.000  
##  Mean   :2022-07-16           Mean   :36.49   Mean   : 4.911  
##  3rd Qu.:2022-07-24           3rd Qu.:46.00   3rd Qu.: 8.000  
##  Max.   :2022-07-31           Max.   :55.00   Max.   :10.000  
##  time_since_signup  demographic    group   hours_watched  
##  Min.   : 0.00     Min.   :1.000   A:880   Min.   :0.500  
##  1st Qu.: 5.70     1st Qu.:2.000   B:120   1st Qu.:3.530  
##  Median :11.80     Median :3.000           Median :4.415  
##  Mean   :11.97     Mean   :2.603           Mean   :4.393  
##  3rd Qu.:18.70     3rd Qu.:4.000           3rd Qu.:5.322  
##  Max.   :24.00     Max.   :4.000           Max.   :8.300

#seperate control group and treatment group, separate before and after treatment date
a_after <- data %>% filter(group == "A" & date >= as.Date("2022-07-18")) 
b_after <- data %>% filter(group == "B" & date >= as.Date("2022-07-18"))
whole_before <- data %>% filter(date < as.Date("2022-07-18"))
data_a <- data %>% filter(group == "A")

#summary_a<-as.data.frame(apply(a_after %>% select(-c(date, gender, group)),2,summary))
#summary_b<-as.data.frame(apply(b_after %>% select(-c(date, gender, group)),2,summary))
#summary_a
#summary_b

check for outlier

From the result, there are some outlier in hours_watched column but these outliers don’t look like an impossible value for me that is why I am not going to remove them from the datatset.

par(mfrow=c(1,3))
#age
box_age <- boxplot(data$age,
  ylab = "age",
  main = "Boxplot of age"
)
mtext(side=1,paste("Outliers: ", paste(box_age$out, collapse = ", ")))

#social metric
box_social <- boxplot(data$social_metric,
  ylab = "social metric",
  main = "Boxplot of social metric"
)
mtext(side=1,paste("Outliers: ", paste(box_social$out, collapse = ", ")))

#time_since_signup
box_time <- boxplot(data$time_since_signup,
  ylab = "time since signup",
  main = "Boxplot of time since signup"
)
mtext(side=1,paste("Outliers: ", paste(box_time$out, collapse = ", ")))

par(mfrow=c(1,2))
par(cex.main=1)
#demographic
box_demo <- boxplot(data$demographic,
  ylab = "demographic number",
  main = "Boxplot of demographic number"
)
mtext(side=1,paste("Outliers: ", paste(box_demo$out, collapse = ", ")))

#hours_watched
box_demo <- boxplot(data$hours_watched,
  ylab = "hours watched per day",
  main = "Boxplot of hours watched per day"
)
mtext(side=1,paste("Outliers: ", paste(box_demo$out, collapse = ", ")))

Correlation checking

I used visualisations to check the correlation between hours_watched and other variables. I also use the ANOVA

For variable “date”, I assign 7 different colours in the plot to see if day of the week relevant to the hours watched. If we assume that people are watching more on Saturday and Sunday, In September 2022, There are 5 days of weekend before the 18th of the month but there are only 3 days of weekend after the 18th. This scenario can cause some bias. But because the data did not give me the year of this investigation so I can’t plot the weekend and weekday comparison. From the plot, there is no sign of relationship between the day and the hours watched.
For variable “gender”, I use the lm() and anova() the check the null hypothesis that gender and hours_watched are independent from each other and the p-value is 0.402 which is above 0.05. The hypothesis is not rejected and it means the hours_watched and the gender variables are independent from each other.
For variable “age”, “social_metric” and “time_since_signup”, I use the pearson correlation coefficient method with the Student’s test and the p-values are as follow;
- For “age”, the p-value is 2.2e-16(less than 0.05) which mean the two variable is related
- For “social_metric”, the p-value is 1.386e-10(less than 0.05) which mean the two variable is related.
- For “time_since_signup”, the p-value is 0.8446(less than 0.05) which mean the null hypothesis that the correlation is equal to 0 is not rejected so the “time_since_signup” variable doesn’t have correlation to the “hours_watched” variable

par(mfrow=c(2,2))
# date
p1 <- ggplot(data = data_a, aes(x = hours_watched, y = as.character(date))) 
p1 + geom_boxplot(fill = c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00","#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00","#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00","#000000", "#E69F00", "#56B4E9"))

#gender
p2 <- ggplot(data = data_a, aes(x = gender, y = hours_watched)) 
p2 + geom_boxplot()

corr_gender <- lm(hours_watched~gender, data=data)
anova(corr_gender)

## Analysis of Variance Table
## 
## Response: hours_watched
##            Df  Sum Sq Mean Sq F value Pr(>F)
## gender      1    1.25  1.2513  0.7039 0.4017
## Residuals 998 1774.17  1.7777

#age
p3 <- plot(hours_watched~age, data=data_a)
corr_age <- cor.test(data_a$age, data_a$hours_watched, method = 'pearson')
corr_age

## 
##  Pearson's product-moment correlation
## 
## data:  data_a$age and data_a$hours_watched
## t = -21.794, df = 878, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6337707 -0.5478657
## sample estimates:
##        cor 
## -0.5925001

#social_metric
p4 <- plot(hours_watched~social_metric, data=data)
corr_social <- cor.test(data_a$social_metric, data_a$hours_watched, method = 'pearson')
corr_social

## 
##  Pearson's product-moment correlation
## 
## data:  data_a$social_metric and data_a$hours_watched
## t = 6.4953, df = 878, p-value = 1.386e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1501594 0.2762984
## sample estimates:
##       cor 
## 0.2141214

#time_since_signup
p5 <- plot(hours_watched~time_since_signup, data=data_a)
corr_time <- cor.test(data_a$time_since_signup, data_a$hours_watched, method = 'pearson')
corr_time

## 
##  Pearson's product-moment correlation
## 
## data:  data_a$time_since_signup and data_a$hours_watched
## t = -0.19602, df = 878, p-value = 0.8446
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.07267021  0.05949767
## sample estimates:
##         cor 
## -0.00661516

### Linear regression model

par(mfrow=c(2,2))
#age
fit_age <- lm(hours_watched ~ age, data = data_a) 
summary(fit_age)

## 
## Call:
## lm(formula = hours_watched ~ age, data = data_a)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5142 -0.7242 -0.0030  0.7474  3.0046 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.980111   0.126542   55.16   <2e-16 ***
## age         -0.073137   0.003356  -21.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.067 on 878 degrees of freedom
## Multiple R-squared:  0.3511, Adjusted R-squared:  0.3503 
## F-statistic:   475 on 1 and 878 DF,  p-value: < 2.2e-16

a0_age <- coef(fit_age)[1]
a1_age <- coef(fit_age)[2]

# setup x variable with range of interest
xfit_age <- seq(min(data_a$age), max(data_a$age), 1)

# calculate the fitting function yfit based on the coefficients from the model
yfit_age <- a0_age + a1_age * xfit_age

gg_age <- ggplot()
gg_age <- gg_age + geom_point(aes(x = data_a$age, y = data_a$hours_watched))
gg_age <- gg_age+ geom_line(aes(x = xfit_age, y = yfit_age), colour = "#E69F00")
gg_age <- gg_age+ labs(x = 'age', y = 'hours watched')
gg_age

#social metric
fit_social <- lm(hours_watched ~ social_metric, data = data_a) 
summary(fit_age)

## 
## Call:
## lm(formula = hours_watched ~ age, data = data_a)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5142 -0.7242 -0.0030  0.7474  3.0046 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.980111   0.126542   55.16   <2e-16 ***
## age         -0.073137   0.003356  -21.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.067 on 878 degrees of freedom
## Multiple R-squared:  0.3511, Adjusted R-squared:  0.3503 
## F-statistic:   475 on 1 and 878 DF,  p-value: < 2.2e-16

a0_social <- coef(fit_social)[1]
a1_social <- coef(fit_social)[2]

# setup x variable with range of interest
xfit_social <- seq(min(data_a$social_metric), max(data_a$social_metric), 1)

# calculate the fitting function yfit based on the coefficients from the model
yfit_social <- a0_social + a1_social * xfit_social

gg_social <- ggplot() 
gg_social<- gg_social + geom_point(aes(x = data_a$social_metric, y = data_a$hours_watched))
gg_social<- gg_social + geom_line(aes(x = xfit_social, y = yfit_social), colour = "#E69F00")
gg_social<- gg_social + labs(x = 'social metric', y = 'hours watched')
gg_social

#demographic
fit_demo <- lm(hours_watched ~ demographic, data = data_a) 
summary(fit_demo)

## 
## Call:
## lm(formula = hours_watched ~ demographic, data = data_a)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5961 -0.7685 -0.0064  0.7865  3.3832 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.68799    0.09877   57.59   <2e-16 ***
## demographic -0.53062    0.03547  -14.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.183 on 878 degrees of freedom
## Multiple R-squared:  0.2031, Adjusted R-squared:  0.2022 
## F-statistic: 223.8 on 1 and 878 DF,  p-value: < 2.2e-16

a0_demo <- coef(fit_demo)[1]
a1_demo <- coef(fit_demo)[2]

# setup x variable with range of interest
xfit_demo <- seq(min(data_a$demographic), max(data_a$demographic), 1)

# calculate the fitting function yfit based on the coefficients from the model
yfit_demo <- a0_demo + a1_demo * xfit_demo

gg_demo <- ggplot() 
gg_demo<- gg_demo + geom_point(aes(x = data_a$demographic, y = data_a$hours_watched))
gg_demo<- gg_demo + geom_line(aes(x = xfit_demo, y = yfit_demo), colour = "#E69F00")
gg_demo<- gg_demo + labs(x = 'demographic', y = 'hours watched')
gg_demo

#check assumption
par(mfrow=c(2,2))
plot(fit_age)

plot(fit_demo)

plot(fit_social)

Multiple regression

I use all group A data to fit the multiple regression model as they are from the same population, the same factor and the sample size is the biggest which mean it will give the more accurate result than a small sample size data.

hours_watched = 6.535941 0.072279 x age - 0.084869 x social_metric

reg_control <- lm(hours_watched ~ age+social_metric, data = data_a)
summary(reg_control)

## 
## Call:
## lm(formula = hours_watched ~ age + social_metric, data = data_a)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6244 -0.6361 -0.0271  0.6988  2.8773 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.535941   0.137147  47.657  < 2e-16 ***
## age           -0.072279   0.003262 -22.157  < 2e-16 ***
## social_metric  0.084869   0.011619   7.305 6.25e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.037 on 877 degrees of freedom
## Multiple R-squared:  0.3883, Adjusted R-squared:  0.3869 
## F-statistic: 278.3 on 2 and 877 DF,  p-value: < 2.2e-16

Check the demographic of each group

# Demograpic
unique(data$demographic)

## [1] 1 3 4 2

demo_1 <- data %>% filter(demographic==1) %>% select(-date)
summary(demo_1)

##  gender       age        social_metric    time_since_signup  demographic
##  F:216   Min.   :18.00   Min.   : 0.000   Min.   : 0.100    Min.   :1   
##  M:  0   1st Qu.:22.00   1st Qu.: 2.000   1st Qu.: 5.175    1st Qu.:1   
##          Median :27.50   Median : 4.000   Median :11.150    Median :1   
##          Mean   :27.11   Mean   : 4.787   Mean   :11.567    Mean   :1   
##          3rd Qu.:32.00   3rd Qu.: 7.000   3rd Qu.:18.425    3rd Qu.:1   
##          Max.   :35.00   Max.   :10.000   Max.   :23.900    Max.   :1   
##  group   hours_watched  
##  A:203   Min.   :1.930  
##  B: 13   1st Qu.:4.435  
##          Median :5.150  
##          Mean   :5.068  
##          3rd Qu.:5.772  
##          Max.   :8.300

demo_2 <- data %>% filter(demographic==2)%>% select(-date)
summary(demo_2)

##  gender       age        social_metric    time_since_signup  demographic
##  F:  0   Min.   :18.00   Min.   : 0.000   Min.   : 0.100    Min.   :2   
##  M:268   1st Qu.:23.00   1st Qu.: 2.000   1st Qu.: 4.975    1st Qu.:2   
##          Median :27.00   Median : 5.000   Median :11.400    Median :2   
##          Mean   :26.93   Mean   : 4.974   Mean   :11.360    Mean   :2   
##          3rd Qu.:31.00   3rd Qu.: 8.000   3rd Qu.:17.800    3rd Qu.:2   
##          Max.   :35.00   Max.   :10.000   Max.   :23.900    Max.   :2   
##  group   hours_watched  
##  A:236   Min.   :2.060  
##  B: 32   1st Qu.:4.298  
##          Median :5.100  
##          Mean   :5.097  
##          3rd Qu.:5.923  
##          Max.   :8.010

demo_3 <- data %>% filter(demographic==3)%>% select(-date)
summary(demo_3)

##  gender       age        social_metric   time_since_signup  demographic group  
##  F:213   Min.   :36.00   Min.   : 0.00   Min.   : 0.10     Min.   :3    A:197  
##  M:  0   1st Qu.:41.00   1st Qu.: 2.00   1st Qu.: 6.60     1st Qu.:3    B: 16  
##          Median :46.00   Median : 5.00   Median :12.00     Median :3           
##          Mean   :45.82   Mean   : 4.77   Mean   :12.36     Mean   :3           
##          3rd Qu.:51.00   3rd Qu.: 8.00   3rd Qu.:19.00     3rd Qu.:3           
##          Max.   :55.00   Max.   :10.00   Max.   :24.00     Max.   :3           
##  hours_watched  
##  Min.   :0.500  
##  1st Qu.:2.790  
##  Median :3.690  
##  Mean   :3.627  
##  3rd Qu.:4.490  
##  Max.   :6.950

demo_4 <- data %>% filter(demographic==4)%>% select(-date)
summary(demo_4)

##  gender       age        social_metric    time_since_signup  demographic
##  F:  0   Min.   :36.00   Min.   : 0.000   Min.   : 0.00     Min.   :4   
##  M:303   1st Qu.:40.00   1st Qu.: 2.000   1st Qu.: 6.90     1st Qu.:4   
##          Median :45.00   Median : 5.000   Median :12.60     Median :4   
##          Mean   :45.07   Mean   : 5.043   Mean   :12.51     Mean   :4   
##          3rd Qu.:50.00   3rd Qu.: 8.000   3rd Qu.:19.20     3rd Qu.:4   
##          Max.   :55.00   Max.   :10.000   Max.   :24.00     Max.   :4   
##  group   hours_watched  
##  A:244   Min.   :0.790  
##  B: 59   1st Qu.:3.010  
##          Median :3.850  
##          Mean   :3.828  
##          3rd Qu.:4.585  
##          Max.   :6.435

# count the numbers in each demographic category based on the A/B group
check_a_demo <- a_after %>%
  select(demographic) %>% 
  group_by(demographic) %>% 
  mutate(n_a = n()) %>% 
  distinct()

check_b_demo <- b_after %>% 
  select(demographic) %>% 
  group_by(demographic) %>% 
  mutate(n_b = n()) %>% 
  distinct()

# total numbers in each group
n_total_a <- sum(check_a_demo$n_a)
n_total_b <- sum(check_b_demo$n_b)

# proportions in each demographic
check_a_demo$p_a <- check_a_demo$n_a / n_total_a
check_b_demo$p_b <- check_b_demo$n_b / n_total_b

# join on demo categories
check_demo <- inner_join(check_a_demo, check_b_demo)

# calculate the difference in proportions
check_demo$diff <- check_demo$p_a - check_demo$p_b


# check the hypothesis that the two group are balance with X-squared
obs=check_demo$n_b
expected_vec <- check_demo$p_a
chisq.test(obs, p = expected_vec)

## 
##  Chi-squared test for given probabilities
## 
## data:  obs
## X-squared = 55.716, df = 3, p-value = 4.83e-12

# Check with visualisations
check_demo$ID <- seq.int(nrow(check_demo))


plot_demo <- check_demo %>% pivot_longer(c(p_a, p_b), "group", "probability")
plot_demo

## # A tibble: 8 × 7
## # Groups:   demographic [4]
##   demographic   n_a   n_b    diff    ID group value
##         <dbl> <int> <int>   <dbl> <int> <chr> <dbl>
## 1           4    73    59 -0.272      1 p_a   0.220
## 2           4    73    59 -0.272      1 p_b   0.492
## 3           3    89    16  0.135      2 p_a   0.268
## 4           3    89    16  0.135      2 p_b   0.133
## 5           1    74    13  0.115      3 p_a   0.223
## 6           1    74    13  0.115      3 p_b   0.108
## 7           2    96    32  0.0225     4 p_a   0.289
## 8           2    96    32  0.0225     4 p_b   0.267

demo_plot <- ggplot() + geom_col(aes(x=plot_demo$ID, y=plot_demo$value, fill=plot_demo$group), position = "dodge") +  scale_fill_manual(values=c("#E69F00", "#0C6872"))
demo_plot <- demo_plot + ggtitle("Demographic comparison between group A and group B") +
  xlab("Demographic number") + ylab("Probability")
demo_plot <- demo_plot + labs(fill = "Group")
demo_plot

From the results,

Demographic 1 is 18-35 years old female.
Demographic 2 is 18-35 years old male.
Demographic 3 is 36-55 years old female.
Demographic 4 is 36-55 years old male.

Bias

From the Chi-square test on the null hypothesis as the two group have got the same proportion of demographic, The p-value is less than 0.05 which mean the null hypothesis is rejected.

This information shown that there is a bias in the group A/B setup.

Two-sample Hypothesis testing

# Functions 
two_samp_t_stat <- function(xbar1, xbar2, s1, s2, n1, n2) {
  out <- (xbar1 - xbar2) / (sqrt(((s1^2)/n1) + ((s2^2)/n2)))
  out
} 

# Calculate mean of each demographic
mean_a <- a_after %>%
  group_by(demographic) %>%
  summarise_at(vars(hours_watched), list(mean = mean))

mean_b <- b_after %>%
  group_by(demographic) %>%
  summarise_at(vars(hours_watched), list(mean = mean))

# Calculate sd of each demographic
sd_a <- a_after %>%
  group_by(demographic) %>%
  summarise_at(vars(hours_watched), list(sd = sd))

sd_b <- b_after %>%
  group_by(demographic) %>%
  summarise_at(vars(hours_watched), list(sd = sd))

# Demographic 1

x_bar1 <- mean_a[[1,2]]
x_bar2 <- mean_b[[1,2]]
s_1 <- sd_a[[1,2]]
s_2 <- sd_b[[1,2]]
n_1 <- check_a_demo %>% filter(demographic == 1) %>% pull(n_a)
n_2 <- check_b_demo %>% filter(demographic == 1) %>% pull(n_b)
df <- n_1 + n_2 - 2 
#t-value
t_stat1 <- two_samp_t_stat(x_bar1, x_bar2, s_1, s_2, n_1, n_2) 
#t_stat %>% round(digits = 2)
#p-value
p_val1 <- pt(t_stat1, df = df, lower.tail = TRUE) 
p_val1 <- p_val1 * 2 

# Demographic 2

x_bar1 <- mean_a[[2,2]]
x_bar2 <- mean_b[[2,2]]
s_1 <- sd_a[[2,2]]
s_2 <- sd_b[[2,2]]
n_1 <- check_a_demo %>% filter(demographic == 2) %>% pull(n_a)
n_2 <- check_b_demo %>% filter(demographic == 2) %>% pull(n_b)
df <- n_1 + n_2 - 2  
#t-value
t_stat2 <- two_samp_t_stat(x_bar1, x_bar2, s_1, s_2, n_1, n_2)
#t_stat %>% round(digits = 2)
#p-value
p_val2 <- pt(t_stat2, df = df, lower.tail = TRUE) 
p_val2 <- p_val2 * 2 


# Demographic 3

x_bar1 <- mean_a[[3,2]]
x_bar2 <- mean_b[[3,2]]
s_1 <- sd_a[[3,2]]
s_2 <- sd_b[[3,2]]
n_1 <- check_a_demo %>% filter(demographic == 3) %>% pull(n_a)
n_2 <- check_b_demo %>% filter(demographic == 3) %>% pull(n_b)
df <- n_1 + n_2 - 2 
#t-value
t_stat3 <- two_samp_t_stat(x_bar1, x_bar2, s_1, s_2, n_1, n_2) 
#t_stat %>% round(digits = 2)
#p-value
p_val3 <- pt(t_stat3, df = df, lower.tail = TRUE) 
#p_val3 <- p_val3 * 2 

# Demographic 4

x_bar1 <- mean_a[[4,2]]
x_bar2 <- mean_b[[4,2]]
s_1 <- sd_a[[4,2]]
s_2 <- sd_b[[4,2]]
n_1 <- check_a_demo %>% filter(demographic == 4) %>% pull(n_a)
n_2 <- check_b_demo %>% filter(demographic == 4) %>% pull(n_b)
df <- n_1 + n_2 - 2 
#t-value
t_stat4 <- two_samp_t_stat(x_bar1, x_bar2, s_1, s_2, n_1, n_2)
#t_stat %>% round(digits = 2)
#p-value
p_val4 <- pt(t_stat4, df = df, lower.tail = TRUE) 
p_val4 <- p_val4 * 2 

hypo_test <- data.frame("T_value" = c(t_stat1, t_stat2, t_stat3, t_stat4), "P_value"=c(p_val1,p_val2,p_val3,p_val4), row.names = c("Demographic 1","Demographic 2","Demographic 3","Demographic 4"))
hypo_test %<>%  mutate(reject = P_value<0.05)
hypo_test

##                 T_value     P_value reject
## Demographic 1 -2.439771 0.016775078   TRUE
## Demographic 2 -2.570254 0.011325866   TRUE
## Demographic 3 -1.814818 0.036231115   TRUE
## Demographic 4 -3.038549 0.002874262   TRUE

Statistical significance tests

# Group A
g_a <- a_after %>% 
  select(demographic, hours_watched) %>% 
  group_by(demographic) %>%
  mutate(n_a = n(), mean_a = mean(hours_watched)) %>% 
  select(demographic, n_a, mean_a) %>% 
  distinct()

# Group B
g_b <- b_after %>% 
  select(demographic, hours_watched) %>% 
  group_by(demographic) %>%
  mutate(n_b = n(), mean_b = mean(hours_watched)) %>% 
  select(demographic, n_b, mean_b) %>% 
  distinct()

# effect comparison: join on all common column names
effect_comp_df <- inner_join(g_a, g_b)
effect_comp_df %<>% mutate(effect=mean_b-mean_a)

# required sample size
est_sd <- sd(g_a$mean_a)
z_alpha <- 1.96
effect_comp_df %<>% mutate(n_ss = (z_alpha * est_sd / effect)^2)
effect_comp_df %<>% mutate(significant = n_b > n_ss)
effect_comp_df

## # A tibble: 4 × 8
## # Groups:   demographic [4]
##   demographic   n_a mean_a   n_b mean_b effect  n_ss significant
##         <dbl> <int>  <dbl> <int>  <dbl>  <dbl> <dbl> <lgl>      
## 1           4    73   3.69    59   4.28  0.592  8.62 TRUE       
## 2           3    89   3.56    16   4.22  0.661  6.91 TRUE       
## 3           1    74   5.20    13   5.75  0.546 10.1  TRUE       
## 4           2    96   5.11    32   5.71  0.595  8.53 TRUE

Summary

Summarise how you will use statistics to solve the problem or answer your question.

Analysis

Summarise the important variables in your investigation.
Use visualisation to highlight interesting features of the data and tell the overall story.
Apply any appropriate hypothesis tests for your investigation.
Ensure you state and check any assumptions.
Report the appropriate values and interpret the results.
Here are some examples of mathematical equations using RMarkdown:

\[H_0: \mu_1 = \mu_2 \]

\[H_A: \mu_1 \ne \mu_2\]

\[S = \sum^n_{i = 1}d^2_i\]

You can also place equations inline: \(z = \frac{x - \mu}{\sigma}\)

Discussion

Discuss the major findings of your investigation
Discuss any strengths and limitations.
Propose directions for future investigations.
This is a good place to re-state your findings as a final conclusion. What is the one take home message the reader should leave with?
Your final conclusion needs to be very clear.

References

Antoine Soetewey 2020, Outliers detection in R, statsandr, viewed 20 February 2022, https://statsandr.com/blog/outliers-detection-in-r/.

Alboukadel Kassambara, n.d., Correlation Test Between Two Variables in R, STHDA,viewed 20 February 2022, http://www.sthda.com/english/wiki/correlation-test-between-two-variables-in-r

DATAMENTOR, n.d., R Multiple Plots, Datamentor, viewed 26 February 2022, https://www.datamentor.io/r-programming/subplot/

Kan Nishida 2016, Filter with Date data, Medium, viewed 20 February 2022, https://blog.exploratory.io/filter-with-date-function-ce8e84be680

knitr and Jekyll, n.d., Colors (ggplot2), cookbook-r,viewed 20 February 2022, http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/

Why Not Watch Project

Naam Marshall

Last updated: 05 August, 2022