Assessment 3

library(ggplot2)      
library(dplyr)
library(readr)
library(here) 
library(magrittr) 
library(psych)

STEPS AND CONCLUSIONS

The questions raised by the executive at WNW are:

Question 1 Is the new recommendation engine algorithm worth rolling out to all their subscribers?

Question 2 Is there any bias in the data collection?

Question 3 How could any bias be corrected?

Question 4 What improvements could be made to future A/B tests?

STEPS UNDERTAKEN

Step 1 Load streaming data

Step 2 Preprocess the data by inspecting the raw data, understand, tidy and manipulate the data to a form that can be used for analysis.

Step 3 Plot scatter matrix (with correlation) to inspect the relationships between variables in group A from 1/7 to 31/7, which is the whole population excluding the treatment group B. This allows the selection of variables that are not highly correlated and to reduce the multicollinearity issues.

Step 4 From the analysis in Step 3, a multiple linear regression model based on hours_watched (hrw), age and social_metric (socm) is constructed.

Step 5 The four assumptions (linearity, constant variance, normality and independence) of linear model is checked for two linear models: hrw vs age and hrw vs socm to ensure that they are satisfied before the validity of the analyses can be relied on.

Step 6 The streaming habits on the basis of age and gender are studied to gain a better insight and to identify the clustering of the population.

Step 7 The proportions of the demographic 1, 2 3 and 4 are checked to find out if there is any bias between group A and B from 18/7 to 31/7. Divisions based on other variables such as age and gender were also carried out but not included in this report.

Step 8 From step 7, it is found that the data for A/B are biased therefore A and B cannot represent the whole population. However, the sub-sample inside the demographic of A and B can still represent the population in that demographic as they have the same demographic characteristics. The differences in the mean hrw for each demographic group between A and B are analyzed and subject to hypothesis testing. The effect (E) from 18/7 to 31/7 is calculated for step 9.

Step 9 To overcome the problem of bias between A/B, the effect of the new recommendation on the whole dataset from 1/7 to 31/7 is calculated as the weighted average of the final effect. This is the sum of the proportions (p) of the demographic subgroups in the whole population multiply by the effects found in Step 8.

CONCLUSIONS

The new recommendation engine algorithm should not be rolled out to all their subscribers as the A/B grouping is biased and cannot be used to represent all the subscribers of WNW. However, on the basis of demographic subgroups, the A/B testing found that demographic 1 , 2 and 4 in group B stream more hours than those in group A with 95% confidence. The effect of the new recommendation on the whole dataset from 1/7/ to 31/7 is approximated by the weighted average of the final effect, and found to be an increase of 36 minutes.
There is a bias in the data set, the analysis of the demographic subgroups revealed that the discrepancies ranged from -27% to 14% between the proportions in A and B, and the residual is not normally distributed.
The bias can be corrected by a number of techniques. For example, collect several samples with greater than 30 observations; conduct stratified and cluster sampling as well as random sampling.
Apart from those mentioned in point 3, the following improvements should be considered for future A/B tests

Subdivide WNW’s subscribers into more subgroups. e.g. identify cluster groups in gender, age, social metric and other variables to find out which combinations of the subgroups are most responsive to the new recommendation engine.
The treatment group could be subject to a marketing campaign, with the goal of increasing the conversion rate and revenue, which means that there may be a case to make the treatment group as large as possible.
Use common design techniques like block design and factorial experiment.

CODES and EXPLANATION

Load and preprocess “streaming_data”

stream <- "streaming_data.csv"
stream_df <- read.csv(here(stream), header = TRUE)
stream_df$gender <- gsub("F", "0", stream_df$gender) #change gender F to 0
stream_df$gender <- gsub("M", "1", stream_df$gender) #change gender M to 1
stream_df$gender <- as.numeric(stream_df$gender) #convert gender from character to numeric
streamF <- filter(stream_df, stream_df$gender == 0)  # select only females
streamM <- filter(stream_df, stream_df$gender == 1) # select only males

summary(stream_df)      # Inspect the dataset for possible errors
sum(is.na(stream_df))   # Check the dataset to make sure that there are no missing values
mean(is.na(stream_df))  # Check the dataset to make sure that there are no missing values

From the above preprocessing of the data, it is confirmed that the dataset is clean and ready to be used for analysis.

A1 <- filter(stream_df, stream_df$group == "A") #Select group A from start till end
A <- subset(A1, select = -c(date, group)) #Exclude date and group

Scatter matrix and correlation

#pairs.panels(A[,-50], 
#            method = "pearson", # correlation method
#            hist.col = "#00AFBB",
#            density = TRUE,  # show density plots
#            ellipses = TRUE # show correlation ellipses
#        )
# Graph excluded from final submission

From the scatter matrix graph and the correlation numbers, the correlation between age and demographic is 0.76. Only one of them will be included in the multiple linear regression (MLR) model to reduce the multicollinearity problem. Since the hours_watched (hrw) will be used as the dependent variable, and the correlation between hrw and age is -0.59, higher than that between hrw and demographic of -0.45, age is chosen as the preferred independent variable, together with social_metric. The other factors are not included due to their very low correlation (less than 0.03) with hrw.

It is also observed that only the hours_watched displays a normal distribution.

Multiple Linear Regression of all A (MLRA)

MLRA <- lm(hours_watched ~ age + social_metric, data = A) 

summary(MLRA)

The ANOVA for the MLRA shows that the p values for age and social_metric are significantly less than 1%, and the two variables will be retained in the multiple regression model.

The beta coefficient of the intercept suggests that the hour_watched is at least 6.54 hours when the independent variables are zero. This could come from factors not captured by the MLR. e.g. During covid lockdown, people generally spend more time streaming as they could not go to work and participate in other activities.

The beta coefficient for age is -0.07, this is consistent with the scatter plot above where a weak negative relationship between hours_watched and age is also observed. This means that older people do not stream as often as the younger people.

The beta coefficient for social_metric is 0.085. It is observed that the relationship between hours_watched and social-metric is not very definite as the data spread widely in the scatter plot, this can be explained by the nature of the data as they are expressed as categorical values between 0 and 10. However, the weak and positive relationship is significant and confirmed in the MLR.

Check Assumptions of linear model: fit1= hrw vs age

Linearity-The scatter matrix above shows that there is a linear relationship between hrw and age, hence satisfying the linearity assumption.
Constant variance-To check this assumption, a linear model hrw vs age is created (fit1).

y <- A[, 6] # Select hours_watched as y
x <- A[, 2] # Select age as x
fit1 = lm(y ~ x)

Plot fit1 over the data

# setup x variable with the min/max range
a0 <- coef(fit1)[1]
a1 <- coef(fit1)[2]
xfit1 <- seq(min(x), max(x), 1)

yfit1 <- a0 + a1 * xfit1 # calculate the fitting function yfit1

Plot hours_watched vs age

gg <- ggplot()
gg <- gg + geom_point(aes(x = x, y = y, colour ='Observed'))
gg <- gg + geom_line(aes(x = xfit1, y = yfit1), colour = 'black')
gg <- gg + labs(x = 'age', y = 'hours_watched')
# gg Graph excluded for final submission

The hours_watched vs age graph of the observations and the fitted line (fit1) shows that the errors are evenly distributed around the fitted line, thus the assumption of constant variance is satisfied.

Check normal distribution of the error term

Af1 <- a0 + a1 * A$age  #Af1 is the forecast hours_watched based on the original age observations

A$e1 <- A$hours_watched - Af1 # calculate the residual
gg <- ggplot()
gg <- gg + geom_point(aes(x, y = A$e1))
gg <- gg + labs(x = 'x', y = 'residual')

#qqnorm(A$e1, 
#       main = 'QQ Plot for residual1 of hours_watched', 
#        xlab = 'Theoretical Quantiles', ylab = 'Residual1', #col="violet")
#qqline(A$e1)

# Graphs excluded from final submission

The QQPlot of the residual of hours_watched shows that the residual satisifies the normality assumption.

Check histogram of residual

gg <- ggplot()
gg <- gg + geom_histogram(aes(x = A$e1), bins = 10, colour = "black") 
gg <- gg + labs(title = "Residual1 of hours_watched", x = "Residual1") 
# gg graph excluded from final submission

The histogram confirms that the residual of hours_watched is normally distributed.

The independence assumption is satisfied as the residuals have no trend in the residual plot.

Check Assumptions of linear model: fit2 = hrw vs social_metric

Linearity-The scatter plot shows that there is a linear relationship between hrw and social_metric, hence satisfying the linearity assumption.
Constant variance is satisfied since the errors are evenly distributed around the fitted line.

Create a linear model hrw vs social_metric

y2 <- A[, 6] # Select hours_watched as y
x2 <- A[, 3] # Select social_watched as x2
fit2 = lm(y2 ~ x2)

summary(fit2)

Plot fit2 over the data

# setup x2 variable with the min/max range

xfit2 <- seq(min(x2), max(x2), 1)
b0 <- coef(fit2)[1]
b1 <- coef(fit2)[2]
yfit2 <- b0 + b1 * xfit2 # calculate the fitting function yfit2

gg <- ggplot()
gg <- gg + geom_point(aes(x = x2, y = y2, colour='observed'))
gg <- gg + geom_line(aes(x = xfit2, y = yfit2), colour = 'black')
gg <- gg + labs(x = 'social_Metric', y = 'hours_watched')
#gg  #Graph excluded from final version

The hours_watched vs social_metric graph of the observations and the fitted line (fit2) shows that the errors are evenly distributed around the fitted line, thus the assumption of constant variance is satisfied.

Check normal distribution of the error term

Af2 <- b0 + b1 * A$social_metric  #Af2 is the forecast hours_watched based on the original age observations

A$e2 <- A$hours_watched - Af2  # calculate the residual
gg <- ggplot()
gg <- gg + geom_point(aes(x, y = A$e2))
gg <- gg + labs(x = 'x', y = 'residual')
#gg Graph excluded from final submission

#qqnorm(A$e2, 
#       main = 'QQ Plot for residual2 of hours_watched', 
#       xlab = 'Theoretical Quantiles', ylab = 'Residual', #col="violet")
#qqline(A$e2)
#Graphs excluded from final submission

The QQPlot of the residual of hours_watched shows that the residual satisfies the normality assumption.

The independence assumption is satisfied as the residuals are randomly and linearly distributed.

Check histogram of residual

gg <- ggplot()
gg <- gg + geom_histogram(aes(x = A$e2), bins = 10, colour = "black") 
gg <- gg + labs(title = "Residual2 of hours_watched", x = "Residual2") 
#gg  Graph excluded from final submission

The histogram confirms that the residual of hours_watched is normally distributed.

Examine the hours_watched characteristics of the variables

Streaming behaviour of old vs young

stream_df <- read.csv(here(stream), header = TRUE)

p1<- ggplot(stream_df, aes(x=age, y=hours_watched)) +
  geom_point(size=2, shape=23)
#p1 Graph excluded from final submission

The plot shows that older people spend less time streaming than younger people.

Check for the streaming behaviour between Female and Male

p2<- ggplot(stream_df, aes(x=gender, y=hours_watched)) +
  geom_point(size=2, shape=23)

#p2 Graph excluded from final submission

Difference of hrw between female and male

Fmeanhrw <- mean(streamF$hours_watched)  #Average hrw of females (0)
Mmeanhrw <- mean(streamM$hours_watched)  #Average hrw of males (1)

Average hrw watched by female is 4.35 hours. Average hrw watched by male is 4.42 hours. On average, male spend slighly more time (4.2 min) than female, and the plot shows that there are more males streaming between 7 to 8 hrw than females.

Check age distribution

stream_df2 <- read.csv(here(stream), header = TRUE)

p4 <-ggplot(data=stream_df2, aes(x=age, y=hours_watched)) +
  geom_bar(stat="identity")
#p4 Grapgh excluded from final submission

The hours_watched for ages 20, 23 and 28 exceeds 175 hours.

Age would be a logical divider in terms of demographic groups of interest in later analysis.

Different marketing strategies would be needed based on age and gender.

gg <- ggplot(stream_df2, aes(x = age, fill = gender))
gg <- gg + geom_histogram(alpha = 0.5, position = "identity", binwidth = 1)
#gg Graph excluded from final submission

Clusters

int_breaks_rounded <- function(x, n = 10) pretty(x, n)[round(pretty(x, n),1) %% 1 == 0]

gg <- ggplot()
gg <- gg + geom_point(aes(x = stream_df2$age,
                          y = stream_df2$hours_watched))

gg <- gg + scale_y_continuous(breaks = int_breaks_rounded )

gg <- gg + labs(x = 'Age (years)', y = 'hours_watched')
#gg Graph excluded from final submission

A clear split in this data is when the hours_watched is 7. The plot shows that for ages below 35 many of the hours_watched exceed 7 hours.

Check for Bias

Check for any bias in A/B based on WNW’s selection

Calculate proportions of demographic (demo) distribution in A

Separate demographic into 1 to 4

post177 <- stream_df[549:1000,] #Select rows from 18 July till end
M2A <- filter(post177, post177$group == "A") #Select group A from 18 July till end

AM2demo1 <- filter(M2A, M2A$demographic == 1) # contains demographic = 1
AM2demo2 <- filter(M2A, M2A$demographic == 2) # contains demographic = 2
AM2demo3 <- filter(M2A, M2A$demographic == 3) # contains demographic = 3
AM2demo4 <- filter(M2A, M2A$demographic == 4) # contains demographic = 4
AM2demo  <- filter(M2A, M2A$demographic <= 5) # contains demographic = 1 to 4

Ademo1Count <- AM2demo1%>% count(demographic) # Count of demographic = 1
Ademo2Count <- AM2demo2%>% count(demographic) # Count of demographic = 2
Ademo3Count <- AM2demo3%>% count(demographic) # Count of demographic = 3
Ademo4Count <- AM2demo4%>% count(demographic) # Count of demographic = 4
AdemoCount <- AM2demo%>% count(demographic)   # Count of all demographic in A

Ademo1C <- sum(Ademo1Count[,2]) # Extract count of demo = 1
Ademo2C <- sum(Ademo2Count[,2]) # Extract count of demo = 2
Ademo3C <- sum(Ademo3Count[,2]) # Extract count of demo = 3
Ademo4C <- sum(Ademo4Count[,2]) # Extract count of demo = 4
AdemoC <- sum(AdemoCount[,2])   # Extract count of all demo
ATot_demo <- AdemoC

#Calculate proportions in A
Ademo1 <- round((Ademo1C/ATot_demo) * 100, digits= 2) 
Ademo2 <- round((Ademo2C/ATot_demo) * 100, digits= 2)
Ademo3 <- round((Ademo3C/ATot_demo) * 100, digits= 2)
Ademo4 <- round((Ademo4C/ATot_demo) * 100, digits= 2)

Calculate proportions of demographic (demo) distribution in B

Separate demographic into 1 to 4

M2B <- filter(post177, post177$group == "B")
BM2demo1 <- filter(M2B, M2B$demographic == 1) # contains demographic = 1
BM2demo2 <- filter(M2B, M2B$demographic == 2) # contains demographic = 2
BM2demo3 <- filter(M2B, M2B$demographic == 3) # contains demographic = 3
BM2demo4 <- filter(M2B, M2B$demographic == 4) # contains demographic = 4
BM2demo  <- filter(M2B, M2B$demographic <= 5) # contains demographic = 1 to 4

Bdemo1Count <- BM2demo1%>% count(demographic) # Count of demographic = 1
Bdemo2Count <- BM2demo2%>% count(demographic) # Count of demographic = 2
Bdemo3Count <- BM2demo3%>% count(demographic) # Count of demographic = 3
Bdemo4Count <- BM2demo4%>% count(demographic) # Count of demographic = 4
BdemoCount <- BM2demo%>% count(demographic)   # Count of all demographic in A

Bdemo1C <- sum(Bdemo1Count[,2]) # Extract number of demo = 1
Bdemo2C <- sum(Bdemo2Count[,2]) # Extract number of demo = 2
Bdemo3C <- sum(Bdemo3Count[,2]) # Extract number of demo = 3
Bdemo4C <- sum(Bdemo4Count[,2]) # Extract number of demo = 4
BdemoC <- sum(BdemoCount[,2])   # Extract number of all demographic in B
BTot_demo <- BdemoC

#calculate proportions in B
Bdemo1 <- round((Bdemo1C/BTot_demo) * 100, digits= 2)
Bdemo2 <- round((Bdemo2C/BTot_demo) * 100, digits= 2)
Bdemo3 <- round((Bdemo3C/BTot_demo) * 100, digits= 2)
Bdemo4 <- round((Bdemo4C/BTot_demo) * 100, digits= 2)

Check group balances

WNW setup the A/B groups, check if A/B groups represent the total population.

Count the numbers in each demographic category based on the A/B group from 18/7

check_a_df <- M2A %>% 
  select(demographic) %>% 
  group_by(demographic) %>% 
  mutate(n_a = n()) %>% 
  distinct()
  
check_b_df <- M2B %>% 
select(demographic) %>% 
  group_by(demographic) %>% 
mutate(n_b = n()) %>% 
distinct()

n_total_a <- sum(M2A$group == 'A') # total numbers in A
n_total_b <- sum(M2B$group == 'B') # total numbers in A

check_a_df$p_a <- check_a_df$n_a / n_total_a 
check_b_df$p_b <- check_b_df$n_b / n_total_b
check_df <- inner_join(check_a_df, check_b_df) # join on demo categories

# calculate the difference in proportions
check_df$diff <- check_df$p_a - check_df$p_b  

# if there is no bias apart from sampling noise then the difference should be small and normally distributed

#qqnorm(check_df$diff, 
#       main = 'Residual of difference in demographic  #proportions between A and B )', 
#       xlab = 'Theoretical Quantiles', ylab = 'Residual1', #col="violet")
        
#qqline(check_df$diff)

# Graphs excluded from final submission

The plot shows that the error is not normally distributed and showing a left skewed distribution. This is confirmed in the histogram below.

Histogram of error in the demographic distribution

# total numbers in each group
n_total_a <- sum(M2A$group == 'A')
n_total_b <- sum(M2B$group == 'B')

check_a_df$p_a <- check_a_df$n_a / n_total_a # proportions in demographic A
check_b_df$p_b <- check_b_df$n_b / n_total_b # proportions in demographic B

check_df <- inner_join(check_a_df, check_b_df) # join on demo categories

check_df$diff <- check_df$p_a - check_df$p_b   # difference in proportions

gg <- ggplot()
gg <- gg + geom_histogram(aes(x = check_df$diff), bins = 10, colour = "black") 
gg <- gg + labs(title = "Residual of difference in demographic proportions between A and B", x = "Residual") 
#gg  Graph excluded from final submission

Difference in demographic proportions between A and B

demo1ABdiff <- Ademo1 - Bdemo1 #% Difference bet demographic 1 in A and B
demo2ABdiff <- Ademo2 - Bdemo2 #% Difference bet demographic 2 in A and B
demo3ABdiff <- Ademo3 - Bdemo3 #% Difference bet demographic 3 in A and B
demo4ABdiff <- Ademo4 - Bdemo4 #% Difference bet demographic 4 in A and B

#demo1ABdiff
#demo2ABdiff
#demo3ABdiff
#demo4ABdiff

Summary of differences in proportion in A and B.

Demographic 1 = 11.46%

Demographic 2 = 2.25%

Demographic 3 = 13.48%

Demographic 4 = -27.18%

The differences in demographic proportions in A and B shown in the analyses above confirm that the data are biased.

As the data is biased, A and B cannot represent the whole population. However, the sub-sample inside the demographic of A and B can still represent the population in that demographic as they have the same demographic characteristics.

To check the differences in average hours_watched for each demographic group between A and B.

# Compare sd

demo1diff <- sd(AM2demo1$hours_watched) - sd(BM2demo1$hours_watched)
demo2diff <- sd(AM2demo2$hours_watched) - sd(BM2demo2$hours_watched)
demo3diff <- sd(AM2demo3$hours_watched) - sd(BM2demo3$hours_watched)
demo4diff <- sd(AM2demo4$hours_watched) - sd(BM2demo4$hours_watched)

The sd between the demographic groups in A/B are found to be unequal, therefore Welsh tests are used in the t test.

# Compare means

demo1Mdiff <- mean(AM2demo1$hours_watched) - mean(BM2demo1$hours_watched)
demo2Mdiff <- mean(AM2demo2$hours_watched) - mean(BM2demo2$hours_watched)
demo3Mdiff <- mean(AM2demo3$hours_watched) - mean(BM2demo3$hours_watched)
demo4Mdiff <- mean(AM2demo4$hours_watched) - mean(BM2demo4$hours_watched)

#demo1Mdiff  Not showsn for final submission
#demo2Mdiff
#demo3Mdiff
#demo4Mdiff

The differences (A - B) in mean hrw between the demographic groups in A/B are: demo1=-0.5459 , demo2=-0.5950. demo3=-0.6611, demo4=-0.5917

Hypothesis Tests

Null Hypothesis: The means of A and B are the same

Alternative Hypothesis: The means of A and B are not the same

t.test(AM2demo1$hours_watched, BM2demo1$hours_watched, alternative = "two.sided", var.equal = FALSE) # demographic = 1

t.test(AM2demo2$hours_watched, BM2demo2$hours_watched, alternative = "two.sided", var.equal = FALSE) # demographic = 2

t.test(AM2demo3$hours_watched, BM2demo3$hours_watched, alternative = "two.sided", var.equal = FALSE) # demographic = 3

t.test(AM2demo4$hours_watched, BM2demo4$hours_watched, alternative = "two.sided", var.equal = FALSE) # demographic = 4

The t tests show that only demographic 3 is insignificant at 5% level of significance. It is concluded that the alternative hypothesis cannot be rejected and that the mean of the hours_watched in demographic 1, 2 and 4 are significantly different between A and B.

Correction of bias in A/B

Use the effects calculated from 18/7 to 31/7, and the proportions from the whole population from 1/7 to 31/7 to estimate the weighted average of the final effect.

Effect=difference of mean hours_watched from 18/7

Ademo1<- mean(AM2demo1$hours_watched)
Ademo2<- mean(AM2demo2$hours_watched)
Ademo3<- mean(AM2demo3$hours_watched)
Ademo4<- mean(AM2demo4$hours_watched)

Bdemo1<- mean(BM2demo1$hours_watched)
Bdemo2<- mean(BM2demo2$hours_watched)
Bdemo3<- mean(BM2demo3$hours_watched)
Bdemo4<- mean(BM2demo4$hours_watched)

# Effect is defined as mean of hrw B - mean of hrw A

demo1E1 <- Bdemo1 - Ademo1 #Effect of demo 1
demo2E2 <- Bdemo2 - Ademo2 #Effect of demo 2
demo3E3 <- Bdemo3 - Ademo3 #Effect of demo 3
demo4E4 <- Bdemo4 - Ademo4 #Effect of demo 4

Count the proportion of demographic group based on the complete dataset from 1/7 to 31/7.

check_w_df <- stream_df %>% 
select(demographic) %>% 
  group_by(demographic) %>% 
mutate(n_w = n()) %>% 
distinct()

Wn_total1 <- count(stream_df, "demographic")
Wn_total <- Wn_total1[,2]

Weighted Average of the Final Effect=sum of (p x E of each demographic group)

check_w_df$p_w <- check_w_df$n_w / Wn_total
df2 <- arrange(check_w_df, demographic) #sort demographic in order
Effect <- c(demo1E1, demo2E2, demo3E3, demo4E4)
Prop <- df2$p_w

WAFinalE <- round(sum(Effect * Prop), digits=4) # Weighted Average final effect

The Weighted Average of the final effect of hours_watched due to the new recommendation engine is estimated to be 0.5975 hours (about 36 min)

REFERENCE LIST

A/B testing, viewed 18 Feb 2022 https://www.analyticsvidhya.com/blog/2020/10/ab-testing-data-science/
Count observations, viewed 18 Feb 2022 https://dplyr.tidyverse.org/reference/count.html
Descriptive statistics, viewed 16 Feb 2022 https://www.google.com/search?client=firefox-b-d&q=how+to+do+descrptive+stat+in+r
Group by variables, viewed 18 Feb 2022 https://dplyr.tidyverse.org/reference/group_by.html
Group by function, viewed 18 Feb 2022 https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/group_by
How to print several numbers, viewed 20 Feb 2022 https://www.google.com/search?client=firefox-b-d&q=how+to+print+several+numbers+in+r
How to include names in print, viewed 20 Feb 2022 https://www.google.com/search?client=firefox-b-d&q=how+to+include+names+in+print+in+r
How to use color blind friendly palettes, viewed 18 Feb 2022 https://venngage.com/blog/color-blind-friendly-palette/#4
How to adjust binwidth, viewed 18 Feb 2022 https://stackoverflow.com/questions/14200027/how-to-adjust-binwidth-in-ggplot2
Missing values, viewed 20 Feb 2022 https://datascienceplus.com/missing-values-in-r/
Paired samples T-test, viewed 17 Feb 2022 http://www.sthda.com/english/wiki/paired-samples-t-test-in-r
Rename character, viewed 20 Feb 2022 https://www.google.com/search?client=firefox-b-d&q=how+to+rename+character+in+cell+in+r
RMIT Course Math 2406 Appplied Analytic
Scatter matrix, viewed 18 Feb 2022 http://www.sthda.com/english/wiki/scatter-plot-matrices-r-base-graphs
Unpaired samples T-test, viewed 17 Feb 2022 http://www.sthda.com/english/wiki/unpaired-two-samples-t-test-in-r#research-questions-and-statistical-hypotheses
Welch t-test, viewed 17 Feb 2022 https://en.wikipedia.org/wiki/Welch%27s_t-test

Assessment 3

Wing Siang Chang, s3918329

27/2/2022

STEPS AND CONCLUSIONS

CONCLUSIONS

CODES and EXPLANATION