The streaming company called “Why Not Watch” wants to improve their recommendation engine to provide better recommendation to customers. the better recommendations increase user engagement and increase the average hours watched per user per day which is the key metric used to price ads for the third party marketing companies.
The overall question of this investigation is whether or not the new recommendation engine algorithm is worth launching. There are a few more important questions need to be answered to ensure the next investigation will be more effective as follow;
Is there any bias in the data collection?
How could any bias be corrected?
What improvements could be made to future A/B tests?
#Load data
data <- read_csv("/Users/marshall/Desktop/Applied analytics/Assessment 3/streaming_data.csv")
data$date %<>% as.Date(data$date, format = "%d/%m") #convert to date
data$gender %<>% as.factor() #convert to factor
data$group %<>% as.factor() #convert to factor
summary(data) #check the summary of each variable
## date gender age social_metric
## Min. :2022-07-01 F:429 Min. :18.00 Min. : 0.000
## 1st Qu.:2022-07-08 M:571 1st Qu.:28.00 1st Qu.: 2.000
## Median :2022-07-16 Median :36.00 Median : 5.000
## Mean :2022-07-16 Mean :36.49 Mean : 4.911
## 3rd Qu.:2022-07-24 3rd Qu.:46.00 3rd Qu.: 8.000
## Max. :2022-07-31 Max. :55.00 Max. :10.000
## time_since_signup demographic group hours_watched
## Min. : 0.00 Min. :1.000 A:880 Min. :0.500
## 1st Qu.: 5.70 1st Qu.:2.000 B:120 1st Qu.:3.530
## Median :11.80 Median :3.000 Median :4.415
## Mean :11.97 Mean :2.603 Mean :4.393
## 3rd Qu.:18.70 3rd Qu.:4.000 3rd Qu.:5.322
## Max. :24.00 Max. :4.000 Max. :8.300
#seperate control group and treatment group, separate before and after treatment date
a_after <- data %>% filter(group == "A" & date >= as.Date("2022-07-18"))
b_after <- data %>% filter(group == "B" & date >= as.Date("2022-07-18"))
whole_before <- data %>% filter(date < as.Date("2022-07-18"))
data_a <- data %>% filter(group == "A")
#summary_a<-as.data.frame(apply(a_after %>% select(-c(date, gender, group)),2,summary))
#summary_b<-as.data.frame(apply(b_after %>% select(-c(date, gender, group)),2,summary))
#summary_a
#summary_b
From the result, there are some outlier in hours_watched column but these outliers don’t look like an impossible value for me that is why I am not going to remove them from the datatset.
par(mfrow=c(1,3))
#age
box_age <- boxplot(data$age,
ylab = "age",
main = "Boxplot of age"
)
mtext(side=1,paste("Outliers: ", paste(box_age$out, collapse = ", ")))
#social metric
box_social <- boxplot(data$social_metric,
ylab = "social metric",
main = "Boxplot of social metric"
)
mtext(side=1,paste("Outliers: ", paste(box_social$out, collapse = ", ")))
#time_since_signup
box_time <- boxplot(data$time_since_signup,
ylab = "time since signup",
main = "Boxplot of time since signup"
)
mtext(side=1,paste("Outliers: ", paste(box_time$out, collapse = ", ")))
par(mfrow=c(1,2))
par(cex.main=1)
#demographic
box_demo <- boxplot(data$demographic,
ylab = "demographic number",
main = "Boxplot of demographic number"
)
mtext(side=1,paste("Outliers: ", paste(box_demo$out, collapse = ", ")))
#hours_watched
box_demo <- boxplot(data$hours_watched,
ylab = "hours watched per day",
main = "Boxplot of hours watched per day"
)
mtext(side=1,paste("Outliers: ", paste(box_demo$out, collapse = ", ")))
I used visualisations to check the correlation between hours_watched and other variables. I also use the ANOVA
For variable “date”, I assign 7 different colours in the plot to see if day of the week relevant to the hours watched. If we assume that people are watching more on Saturday and Sunday, In September 2022, There are 5 days of weekend before the 18th of the month but there are only 3 days of weekend after the 18th. This scenario can cause some bias. But because the data did not give me the year of this investigation so I can’t plot the weekend and weekday comparison. From the plot, there is no sign of relationship between the day and the hours watched.
For variable “gender”, I use the lm() and anova() the check the null hypothesis that gender and hours_watched are independent from each other and the p-value is 0.402 which is above 0.05. The hypothesis is not rejected and it means the hours_watched and the gender variables are independent from each other.
For variable “age”, “social_metric” and “time_since_signup”, I use the pearson correlation coefficient method with the Student’s test and the p-values are as follow;
par(mfrow=c(2,2))
# date
p1 <- ggplot(data = data_a, aes(x = hours_watched, y = as.character(date)))
p1 + geom_boxplot(fill = c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00","#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00","#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00","#000000", "#E69F00", "#56B4E9"))
#gender
p2 <- ggplot(data = data_a, aes(x = gender, y = hours_watched))
p2 + geom_boxplot()
corr_gender <- lm(hours_watched~gender, data=data)
anova(corr_gender)
## Analysis of Variance Table
##
## Response: hours_watched
## Df Sum Sq Mean Sq F value Pr(>F)
## gender 1 1.25 1.2513 0.7039 0.4017
## Residuals 998 1774.17 1.7777
#age
p3 <- plot(hours_watched~age, data=data_a)
corr_age <- cor.test(data_a$age, data_a$hours_watched, method = 'pearson')
corr_age
##
## Pearson's product-moment correlation
##
## data: data_a$age and data_a$hours_watched
## t = -21.794, df = 878, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6337707 -0.5478657
## sample estimates:
## cor
## -0.5925001
#social_metric
p4 <- plot(hours_watched~social_metric, data=data)
corr_social <- cor.test(data_a$social_metric, data_a$hours_watched, method = 'pearson')
corr_social
##
## Pearson's product-moment correlation
##
## data: data_a$social_metric and data_a$hours_watched
## t = 6.4953, df = 878, p-value = 1.386e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1501594 0.2762984
## sample estimates:
## cor
## 0.2141214
#time_since_signup
p5 <- plot(hours_watched~time_since_signup, data=data_a)
corr_time <- cor.test(data_a$time_since_signup, data_a$hours_watched, method = 'pearson')
corr_time
##
## Pearson's product-moment correlation
##
## data: data_a$time_since_signup and data_a$hours_watched
## t = -0.19602, df = 878, p-value = 0.8446
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.07267021 0.05949767
## sample estimates:
## cor
## -0.00661516
### Linear regression model
par(mfrow=c(2,2))
#age
fit_age <- lm(hours_watched ~ age, data = data_a)
summary(fit_age)
##
## Call:
## lm(formula = hours_watched ~ age, data = data_a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5142 -0.7242 -0.0030 0.7474 3.0046
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.980111 0.126542 55.16 <2e-16 ***
## age -0.073137 0.003356 -21.79 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.067 on 878 degrees of freedom
## Multiple R-squared: 0.3511, Adjusted R-squared: 0.3503
## F-statistic: 475 on 1 and 878 DF, p-value: < 2.2e-16
a0_age <- coef(fit_age)[1]
a1_age <- coef(fit_age)[2]
# setup x variable with range of interest
xfit_age <- seq(min(data_a$age), max(data_a$age), 1)
# calculate the fitting function yfit based on the coefficients from the model
yfit_age <- a0_age + a1_age * xfit_age
gg_age <- ggplot()
gg_age <- gg_age + geom_point(aes(x = data_a$age, y = data_a$hours_watched))
gg_age <- gg_age+ geom_line(aes(x = xfit_age, y = yfit_age), colour = "#E69F00")
gg_age <- gg_age+ labs(x = 'age', y = 'hours watched')
gg_age
#social metric
fit_social <- lm(hours_watched ~ social_metric, data = data_a)
summary(fit_age)
##
## Call:
## lm(formula = hours_watched ~ age, data = data_a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5142 -0.7242 -0.0030 0.7474 3.0046
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.980111 0.126542 55.16 <2e-16 ***
## age -0.073137 0.003356 -21.79 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.067 on 878 degrees of freedom
## Multiple R-squared: 0.3511, Adjusted R-squared: 0.3503
## F-statistic: 475 on 1 and 878 DF, p-value: < 2.2e-16
a0_social <- coef(fit_social)[1]
a1_social <- coef(fit_social)[2]
# setup x variable with range of interest
xfit_social <- seq(min(data_a$social_metric), max(data_a$social_metric), 1)
# calculate the fitting function yfit based on the coefficients from the model
yfit_social <- a0_social + a1_social * xfit_social
gg_social <- ggplot()
gg_social<- gg_social + geom_point(aes(x = data_a$social_metric, y = data_a$hours_watched))
gg_social<- gg_social + geom_line(aes(x = xfit_social, y = yfit_social), colour = "#E69F00")
gg_social<- gg_social + labs(x = 'social metric', y = 'hours watched')
gg_social
#demographic
fit_demo <- lm(hours_watched ~ demographic, data = data_a)
summary(fit_demo)
##
## Call:
## lm(formula = hours_watched ~ demographic, data = data_a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5961 -0.7685 -0.0064 0.7865 3.3832
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.68799 0.09877 57.59 <2e-16 ***
## demographic -0.53062 0.03547 -14.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.183 on 878 degrees of freedom
## Multiple R-squared: 0.2031, Adjusted R-squared: 0.2022
## F-statistic: 223.8 on 1 and 878 DF, p-value: < 2.2e-16
a0_demo <- coef(fit_demo)[1]
a1_demo <- coef(fit_demo)[2]
# setup x variable with range of interest
xfit_demo <- seq(min(data_a$demographic), max(data_a$demographic), 1)
# calculate the fitting function yfit based on the coefficients from the model
yfit_demo <- a0_demo + a1_demo * xfit_demo
gg_demo <- ggplot()
gg_demo<- gg_demo + geom_point(aes(x = data_a$demographic, y = data_a$hours_watched))
gg_demo<- gg_demo + geom_line(aes(x = xfit_demo, y = yfit_demo), colour = "#E69F00")
gg_demo<- gg_demo + labs(x = 'demographic', y = 'hours watched')
gg_demo
#check assumption
par(mfrow=c(2,2))
plot(fit_age)
plot(fit_demo)
plot(fit_social)
I use all group A data to fit the multiple regression model as they are from the same population, the same factor and the sample size is the biggest which mean it will give the more accurate result than a small sample size data.
hours_watched = 6.535941 0.072279 x age - 0.084869 x social_metric
reg_control <- lm(hours_watched ~ age+social_metric, data = data_a)
summary(reg_control)
##
## Call:
## lm(formula = hours_watched ~ age + social_metric, data = data_a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6244 -0.6361 -0.0271 0.6988 2.8773
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.535941 0.137147 47.657 < 2e-16 ***
## age -0.072279 0.003262 -22.157 < 2e-16 ***
## social_metric 0.084869 0.011619 7.305 6.25e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.037 on 877 degrees of freedom
## Multiple R-squared: 0.3883, Adjusted R-squared: 0.3869
## F-statistic: 278.3 on 2 and 877 DF, p-value: < 2.2e-16
# Demograpic
unique(data$demographic)
## [1] 1 3 4 2
demo_1 <- data %>% filter(demographic==1) %>% select(-date)
summary(demo_1)
## gender age social_metric time_since_signup demographic
## F:216 Min. :18.00 Min. : 0.000 Min. : 0.100 Min. :1
## M: 0 1st Qu.:22.00 1st Qu.: 2.000 1st Qu.: 5.175 1st Qu.:1
## Median :27.50 Median : 4.000 Median :11.150 Median :1
## Mean :27.11 Mean : 4.787 Mean :11.567 Mean :1
## 3rd Qu.:32.00 3rd Qu.: 7.000 3rd Qu.:18.425 3rd Qu.:1
## Max. :35.00 Max. :10.000 Max. :23.900 Max. :1
## group hours_watched
## A:203 Min. :1.930
## B: 13 1st Qu.:4.435
## Median :5.150
## Mean :5.068
## 3rd Qu.:5.772
## Max. :8.300
demo_2 <- data %>% filter(demographic==2)%>% select(-date)
summary(demo_2)
## gender age social_metric time_since_signup demographic
## F: 0 Min. :18.00 Min. : 0.000 Min. : 0.100 Min. :2
## M:268 1st Qu.:23.00 1st Qu.: 2.000 1st Qu.: 4.975 1st Qu.:2
## Median :27.00 Median : 5.000 Median :11.400 Median :2
## Mean :26.93 Mean : 4.974 Mean :11.360 Mean :2
## 3rd Qu.:31.00 3rd Qu.: 8.000 3rd Qu.:17.800 3rd Qu.:2
## Max. :35.00 Max. :10.000 Max. :23.900 Max. :2
## group hours_watched
## A:236 Min. :2.060
## B: 32 1st Qu.:4.298
## Median :5.100
## Mean :5.097
## 3rd Qu.:5.923
## Max. :8.010
demo_3 <- data %>% filter(demographic==3)%>% select(-date)
summary(demo_3)
## gender age social_metric time_since_signup demographic group
## F:213 Min. :36.00 Min. : 0.00 Min. : 0.10 Min. :3 A:197
## M: 0 1st Qu.:41.00 1st Qu.: 2.00 1st Qu.: 6.60 1st Qu.:3 B: 16
## Median :46.00 Median : 5.00 Median :12.00 Median :3
## Mean :45.82 Mean : 4.77 Mean :12.36 Mean :3
## 3rd Qu.:51.00 3rd Qu.: 8.00 3rd Qu.:19.00 3rd Qu.:3
## Max. :55.00 Max. :10.00 Max. :24.00 Max. :3
## hours_watched
## Min. :0.500
## 1st Qu.:2.790
## Median :3.690
## Mean :3.627
## 3rd Qu.:4.490
## Max. :6.950
demo_4 <- data %>% filter(demographic==4)%>% select(-date)
summary(demo_4)
## gender age social_metric time_since_signup demographic
## F: 0 Min. :36.00 Min. : 0.000 Min. : 0.00 Min. :4
## M:303 1st Qu.:40.00 1st Qu.: 2.000 1st Qu.: 6.90 1st Qu.:4
## Median :45.00 Median : 5.000 Median :12.60 Median :4
## Mean :45.07 Mean : 5.043 Mean :12.51 Mean :4
## 3rd Qu.:50.00 3rd Qu.: 8.000 3rd Qu.:19.20 3rd Qu.:4
## Max. :55.00 Max. :10.000 Max. :24.00 Max. :4
## group hours_watched
## A:244 Min. :0.790
## B: 59 1st Qu.:3.010
## Median :3.850
## Mean :3.828
## 3rd Qu.:4.585
## Max. :6.435
# count the numbers in each demographic category based on the A/B group
check_a_demo <- a_after %>%
select(demographic) %>%
group_by(demographic) %>%
mutate(n_a = n()) %>%
distinct()
check_b_demo <- b_after %>%
select(demographic) %>%
group_by(demographic) %>%
mutate(n_b = n()) %>%
distinct()
# total numbers in each group
n_total_a <- sum(check_a_demo$n_a)
n_total_b <- sum(check_b_demo$n_b)
# proportions in each demographic
check_a_demo$p_a <- check_a_demo$n_a / n_total_a
check_b_demo$p_b <- check_b_demo$n_b / n_total_b
# join on demo categories
check_demo <- inner_join(check_a_demo, check_b_demo)
# calculate the difference in proportions
check_demo$diff <- check_demo$p_a - check_demo$p_b
# check the hypothesis that the two group are balance with X-squared
obs=check_demo$n_b
expected_vec <- check_demo$p_a
chisq.test(obs, p = expected_vec)
##
## Chi-squared test for given probabilities
##
## data: obs
## X-squared = 55.716, df = 3, p-value = 4.83e-12
# Check with visualisations
check_demo$ID <- seq.int(nrow(check_demo))
plot_demo <- check_demo %>% pivot_longer(c(p_a, p_b), "group", "probability")
plot_demo
## # A tibble: 8 × 7
## # Groups: demographic [4]
## demographic n_a n_b diff ID group value
## <dbl> <int> <int> <dbl> <int> <chr> <dbl>
## 1 4 73 59 -0.272 1 p_a 0.220
## 2 4 73 59 -0.272 1 p_b 0.492
## 3 3 89 16 0.135 2 p_a 0.268
## 4 3 89 16 0.135 2 p_b 0.133
## 5 1 74 13 0.115 3 p_a 0.223
## 6 1 74 13 0.115 3 p_b 0.108
## 7 2 96 32 0.0225 4 p_a 0.289
## 8 2 96 32 0.0225 4 p_b 0.267
demo_plot <- ggplot() + geom_col(aes(x=plot_demo$ID, y=plot_demo$value, fill=plot_demo$group), position = "dodge") + scale_fill_manual(values=c("#E69F00", "#0C6872"))
demo_plot <- demo_plot + ggtitle("Demographic comparison between group A and group B") +
xlab("Demographic number") + ylab("Probability")
demo_plot <- demo_plot + labs(fill = "Group")
demo_plot
This information shown that there is a bias in the group A/B setup.
# Functions
two_samp_t_stat <- function(xbar1, xbar2, s1, s2, n1, n2) {
out <- (xbar1 - xbar2) / (sqrt(((s1^2)/n1) + ((s2^2)/n2)))
out
}
# Calculate mean of each demographic
mean_a <- a_after %>%
group_by(demographic) %>%
summarise_at(vars(hours_watched), list(mean = mean))
mean_b <- b_after %>%
group_by(demographic) %>%
summarise_at(vars(hours_watched), list(mean = mean))
# Calculate sd of each demographic
sd_a <- a_after %>%
group_by(demographic) %>%
summarise_at(vars(hours_watched), list(sd = sd))
sd_b <- b_after %>%
group_by(demographic) %>%
summarise_at(vars(hours_watched), list(sd = sd))
# Demographic 1
x_bar1 <- mean_a[[1,2]]
x_bar2 <- mean_b[[1,2]]
s_1 <- sd_a[[1,2]]
s_2 <- sd_b[[1,2]]
n_1 <- check_a_demo %>% filter(demographic == 1) %>% pull(n_a)
n_2 <- check_b_demo %>% filter(demographic == 1) %>% pull(n_b)
df <- n_1 + n_2 - 2
#t-value
t_stat1 <- two_samp_t_stat(x_bar1, x_bar2, s_1, s_2, n_1, n_2)
#t_stat %>% round(digits = 2)
#p-value
p_val1 <- pt(t_stat1, df = df, lower.tail = TRUE)
p_val1 <- p_val1 * 2
# Demographic 2
x_bar1 <- mean_a[[2,2]]
x_bar2 <- mean_b[[2,2]]
s_1 <- sd_a[[2,2]]
s_2 <- sd_b[[2,2]]
n_1 <- check_a_demo %>% filter(demographic == 2) %>% pull(n_a)
n_2 <- check_b_demo %>% filter(demographic == 2) %>% pull(n_b)
df <- n_1 + n_2 - 2
#t-value
t_stat2 <- two_samp_t_stat(x_bar1, x_bar2, s_1, s_2, n_1, n_2)
#t_stat %>% round(digits = 2)
#p-value
p_val2 <- pt(t_stat2, df = df, lower.tail = TRUE)
p_val2 <- p_val2 * 2
# Demographic 3
x_bar1 <- mean_a[[3,2]]
x_bar2 <- mean_b[[3,2]]
s_1 <- sd_a[[3,2]]
s_2 <- sd_b[[3,2]]
n_1 <- check_a_demo %>% filter(demographic == 3) %>% pull(n_a)
n_2 <- check_b_demo %>% filter(demographic == 3) %>% pull(n_b)
df <- n_1 + n_2 - 2
#t-value
t_stat3 <- two_samp_t_stat(x_bar1, x_bar2, s_1, s_2, n_1, n_2)
#t_stat %>% round(digits = 2)
#p-value
p_val3 <- pt(t_stat3, df = df, lower.tail = TRUE)
#p_val3 <- p_val3 * 2
# Demographic 4
x_bar1 <- mean_a[[4,2]]
x_bar2 <- mean_b[[4,2]]
s_1 <- sd_a[[4,2]]
s_2 <- sd_b[[4,2]]
n_1 <- check_a_demo %>% filter(demographic == 4) %>% pull(n_a)
n_2 <- check_b_demo %>% filter(demographic == 4) %>% pull(n_b)
df <- n_1 + n_2 - 2
#t-value
t_stat4 <- two_samp_t_stat(x_bar1, x_bar2, s_1, s_2, n_1, n_2)
#t_stat %>% round(digits = 2)
#p-value
p_val4 <- pt(t_stat4, df = df, lower.tail = TRUE)
p_val4 <- p_val4 * 2
hypo_test <- data.frame("T_value" = c(t_stat1, t_stat2, t_stat3, t_stat4), "P_value"=c(p_val1,p_val2,p_val3,p_val4), row.names = c("Demographic 1","Demographic 2","Demographic 3","Demographic 4"))
hypo_test %<>% mutate(reject = P_value<0.05)
hypo_test
## T_value P_value reject
## Demographic 1 -2.439771 0.016775078 TRUE
## Demographic 2 -2.570254 0.011325866 TRUE
## Demographic 3 -1.814818 0.036231115 TRUE
## Demographic 4 -3.038549 0.002874262 TRUE
# Group A
g_a <- a_after %>%
select(demographic, hours_watched) %>%
group_by(demographic) %>%
mutate(n_a = n(), mean_a = mean(hours_watched)) %>%
select(demographic, n_a, mean_a) %>%
distinct()
# Group B
g_b <- b_after %>%
select(demographic, hours_watched) %>%
group_by(demographic) %>%
mutate(n_b = n(), mean_b = mean(hours_watched)) %>%
select(demographic, n_b, mean_b) %>%
distinct()
# effect comparison: join on all common column names
effect_comp_df <- inner_join(g_a, g_b)
effect_comp_df %<>% mutate(effect=mean_b-mean_a)
# required sample size
est_sd <- sd(g_a$mean_a)
z_alpha <- 1.96
effect_comp_df %<>% mutate(n_ss = (z_alpha * est_sd / effect)^2)
effect_comp_df %<>% mutate(significant = n_b > n_ss)
effect_comp_df
## # A tibble: 4 × 8
## # Groups: demographic [4]
## demographic n_a mean_a n_b mean_b effect n_ss significant
## <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <lgl>
## 1 4 73 3.69 59 4.28 0.592 8.62 TRUE
## 2 3 89 3.56 16 4.22 0.661 6.91 TRUE
## 3 1 74 5.20 13 5.75 0.546 10.1 TRUE
## 4 2 96 5.11 32 5.71 0.595 8.53 TRUE
\[H_0: \mu_1 = \mu_2 \]
\[H_A: \mu_1 \ne \mu_2\]
\[S = \sum^n_{i = 1}d^2_i\]
Antoine Soetewey 2020, Outliers detection in R, statsandr, viewed 20 February 2022, https://statsandr.com/blog/outliers-detection-in-r/.
Alboukadel Kassambara, n.d., Correlation Test Between Two Variables in R, STHDA,viewed 20 February 2022, http://www.sthda.com/english/wiki/correlation-test-between-two-variables-in-r
DATAMENTOR, n.d., R Multiple Plots, Datamentor, viewed 26 February 2022, https://www.datamentor.io/r-programming/subplot/
Kan Nishida 2016, Filter with Date data, Medium, viewed 20 February 2022, https://blog.exploratory.io/filter-with-date-function-ce8e84be680
knitr and Jekyll, n.d., Colors (ggplot2), cookbook-r,viewed 20 February 2022, http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/