Overview

The world has entered a new age of increasingly frequent and intense periods of turbulence. We cannot help but watch events unfold in real time as we try our best to gauge the shockwaves that will impact the global economy. Some sectors and organisations have been devastated (tourism) but others have flourished. Within the video streaming services, this could not be any truer.

Not so long ago, streaming services disrupted the entertainment industry. Now these digital disruptors are being disrupted themselves. Competition between the streaming titans is heating up and the competition for user engagement through increasingly well targeted content has never been higher. An increase in user engagement means more hours of streaming and plenty of opportunities for advertising revenue and therefore company profit.

Streaming company Why Not Watch? (WNW) has built up a healthy customer base with good content offerings but is continually refining their recommendation engine to provide better recommendations to customers. Better recommendations improve user engagement and importantly increase the average hours watched per user per day, a key metric used to price ads for 3rd party marketing companies.

The executives at WNW want to know if the new recommendation engine algorithm is worth rolling out to all their subscribers. They have asked you to analyse the results from a recent change they made in their recommendation engine and present the results to the executive team.

The executives are also interested in any other insights you may learn from the sample data (available to download on the Assessment 3 page in Canvas). In particular, they are curious about the following: • Is there any bias in the data collection? • How could any bias be corrected? • What improvements could be made to future A/B tests?

Introduction

As the streaming services market matures, the need to thoroughly test new technologies in the platforms, whilst safeguarding the customer base, increases, in order to provide differentiation.

This has never been easier, given the amount of data generated on a daily basis, which has placed an important focus in the data departments of these companies, in order to provide guidance in decision making.

In this instance, a new recommender system has been tested live in the platform, in order to gain data about its efficiency, what could seem as a straight comparison, requires a deeper look into the control group, in order to understand the validity of the results, which this report will provide.

Problem Statement

The main questions we endeavour to investigate are:

is there a significant difference in the mean hours watched between the groups versus?
do subgroups’ sizes, related to particular variables, have an effect in the outcome for the test?
if not, what would be their required size and scope of a new test?
what other findings this test provides us?
what correlations can we find in the analysis?

Setup

Setting up the knitr options for the report.

Installing and loading the packages you need to produce the report below.

The sample data

The sample data shows the effect of an A/B test conducted to measure the effectiveness of a change to the recommendation engine used on some subscribers, but not others. The change to the recommendation went live at 1-minute past midnight on the 18th of July.

Those customers who were unknowingly using the new recommendation engine to suggest what to watch next are labelled as group B, while group A was used as a control group.

Loading the data

streaming_data <- read.csv(
  here::here("/Users/samuelklettnavarro/PY4E/MATH2406_codes/data_MATH2406", 
       "streaming_data.csv"))
streamdf <- streaming_data 
# making a copy in case something goes wrong down the track
head(streamdf)

We know data set has the following fields: • date in format dd/mm (independent variable, IV), which we will transform into a time format. • gender of the customer (IV), this variable could be important to check if the new recommendation engine actually provides better results for either sex. • age of the customer (IV), same as for the gender, we could drill down and see how the new system performs with different age groups. • social metric, which is a combined metric based on previous viewing habits. This variable carries on a historic value, so we could look at the impact that the new system has on similar metrics, it is not clear how this variable is calculated, so we will need to check whether or not this correlates with other independent variables. • number of months since the customer signed up (IV), this metric might be important in comparing the change on the different type, based on platform use time, of user. • demographic number, this is a representation of the customer age, which we can use to compare possible differences in system performance for all the age brackets, or it could be a combination of other factors needing further studies to declare it independent. • group (A/B) where A is the control and B is the treated group, obviously this is the variable of that we will be using to perform most of the comparisons, it could be created from the date column, though it is easier to have it readily available. • number of hours watched in that day is the variable whose dependency to the other variables we are going to study, in order to rate the performance of the new system.

Let’s have a look at the structure of this dataset.

str(streamdf)

## 'data.frame':    1000 obs. of  8 variables:
##  $ date             : chr  "1-Jul" "1-Jul" "1-Jul" "1-Jul" ...
##  $ gender           : chr  "F" "F" "F" "M" ...
##  $ age              : int  28 32 39 52 25 51 53 42 41 20 ...
##  $ social_metric    : int  5 7 4 10 1 0 5 6 8 7 ...
##  $ time_since_signup: num  19.3 11.5 4.3 9.5 19.5 22.6 4.2 8.5 16.9 23 ...
##  $ demographic      : int  1 1 3 4 2 4 3 4 4 2 ...
##  $ group            : chr  "A" "A" "A" "A" ...
##  $ hours_watched    : num  4.08 2.99 5.74 4.13 4.68 3.4 3.07 2.77 2.24 5.39 ...

Scanning the set for missing values.

colSums(is.na(streamdf))

##              date            gender               age     social_metric 
##                 0                 0                 0                 0 
## time_since_signup       demographic             group     hours_watched 
##                 0                 0                 0                 0

We can see that there are not obvious missing values, as all the variables contain the same number for entries.

Data manipulation and single variable discussion

Looking at the variable gender, before we transform it into a factor.

gender <- unique(streamdf$gender)
str(gender)

##  chr [1:2] "F" "M"

There are only two self explanatory values, so we can look at transforming it into a factor, to facilitate the statistical analysis.

Looking at the variable social_metric, before we transform it into a factor.

socialm <- unique(streamdf$social_metric)
str(socialm)

##  int [1:11] 5 7 4 10 1 0 6 8 9 3 ...

in this case we can see 11 values, though there is no clear explanations as to what each one of them means; however it could be understood that there is an order to these values, so a new column in the form of smf will be created in parallel, in case we need the numeric values for other uses.

Looking at the variable demographic, before we transform it into a factor.

demog <- unique(streamdf$demographic)
str(demog)

##  int [1:4] 1 3 4 2

In order to check and label the age groups, we need to see what they correspond with, in the age column.

demog1 <- streamdf %>%  dplyr::filter(demographic == 1)
min(demog1$age)

## [1] 18

max(demog1$age)

## [1] 35

unique(demog1$gender)

## [1] "F"

As we can see, the demographic groups are divided by age and gender,

demog3 <- streamdf %>%  dplyr::filter(demographic == 3)
min(demog3$age)

## [1] 36

max(demog3$age)

## [1] 55

unique(demog3$gender)

## [1] "F"

with group: - 1 being Females between 18 and 35 - 2 are Males between 18 and 35 - 3 are Females, 36 years or older - 4 Males 36 years or older. Let’s keep and factor these groups and add a new ordered factor column with just the “younger” and “older” tags, for easier comparison further down the track, although further studies could be made with smaller age groups.

And finally, the variable group, before we transform it into a factor.

controlg <- unique(streamdf$group)
str(controlg)

##  chr [1:2] "A" "B"

Which we are going to turn into a factor column.

streamdf$gender %<>% factor(., 
                            levels = gender)
streamdf$smf <- streamdf$social_metric %>% factor(., 
                                                  levels = c(0:10),
                                                  ordered = TRUE)
streamdf$demographic %<>% factor(., 
                            levels = c(1:4),
                            labels = c("F 35 or less", "M 35 or less", "F 36+", "M 36+"))
streamdf$young_old <- ifelse(streamdf$age < 36, "younger", "older")
streamdf$young_old %<>% factor(., c("younger", "older"), ordered = TRUE)
streamdf$group %<>% factor(., 
                            levels = controlg)
str(streamdf)

## 'data.frame':    1000 obs. of  10 variables:
##  $ date             : chr  "1-Jul" "1-Jul" "1-Jul" "1-Jul" ...
##  $ gender           : Factor w/ 2 levels "F","M": 1 1 1 2 2 2 1 2 2 2 ...
##  $ age              : int  28 32 39 52 25 51 53 42 41 20 ...
##  $ social_metric    : int  5 7 4 10 1 0 5 6 8 7 ...
##  $ time_since_signup: num  19.3 11.5 4.3 9.5 19.5 22.6 4.2 8.5 16.9 23 ...
##  $ demographic      : Factor w/ 4 levels "F 35 or less",..: 1 1 3 4 2 4 3 4 4 2 ...
##  $ group            : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
##  $ hours_watched    : num  4.08 2.99 5.74 4.13 4.68 3.4 3.07 2.77 2.24 5.39 ...
##  $ smf              : Ord.factor w/ 11 levels "0"<"1"<"2"<"3"<..: 6 8 5 11 2 1 6 7 9 8 ...
##  $ young_old        : Ord.factor w/ 2 levels "younger"<"older": 1 1 2 2 1 2 2 2 2 1 ...

Let’s turn our attention to the date series and format it correctly, so that we can gain some insights from this, as well as, perhaps divide the set into two blocks, before and after the cut off date.

datevar <- unique(streamdf$date)
datevar

##  [1] "1-Jul"  "2-Jul"  "3-Jul"  "4-Jul"  "5-Jul"  "6-Jul"  "7-Jul"  "8-Jul" 
##  [9] "9-Jul"  "10-Jul" "11-Jul" "12-Jul" "13-Jul" "14-Jul" "15-Jul" "16-Jul"
## [17] "17-Jul" "18-Jul" "19-Jul" "20-Jul" "21-Jul" "22-Jul" "23-Jul" "24-Jul"
## [25] "25-Jul" "26-Jul" "27-Jul" "28-Jul" "29-Jul" "30-Jul" "31-Jul"

As we can see, there is data prior the 18th of July, thus we can look at the trend change. Given that the test was run only in the month of July, we can transform from text to date format, using the only the date number.

streamdf$dayf <- as.Date(streamdf$date, '%d-%b')
#streamdf$date %<>% format(., '%d-%b')
str(streamdf$dayf)

##  Date[1:1000], format: "2022-07-01" "2022-07-01" "2022-07-01" "2022-07-01" "2022-07-01" ...

After several trials, the original format was kept and a new dayf column was added, with a datetime format. Checking the data before performing an analysis.

qplot(streamdf$dayf, streamdf$hours_watched, alpha = 0.3)

there is no obvious trend visible here.

Looking at the scale of numeric variables. First we select the data we need for this operation.

streamdfnum <- streamdf %>% 
  dplyr::select(., c("age", "social_metric", "time_since_signup", "hours_watched"))
head(streamdfnum, 2)

With a quick summary:

streamdfnum %>% summary()

##       age        social_metric    time_since_signup hours_watched  
##  Min.   :18.00   Min.   : 0.000   Min.   : 0.00     Min.   :0.500  
##  1st Qu.:28.00   1st Qu.: 2.000   1st Qu.: 5.70     1st Qu.:3.530  
##  Median :36.00   Median : 5.000   Median :11.80     Median :4.415  
##  Mean   :36.49   Mean   : 4.911   Mean   :11.97     Mean   :4.393  
##  3rd Qu.:46.00   3rd Qu.: 8.000   3rd Qu.:18.70     3rd Qu.:5.322  
##  Max.   :55.00   Max.   :10.000   Max.   :24.00     Max.   :8.300

We can see that the age and social_metric variables are as expected, though we should look at their distributions. First for the age histogram,

hist(streamdfnum$age, breaks = 30, main = "Histogram of ages", xlab = "Age", col = "darkgrey")

which we could say it resembles an uniform distribution, as we could expect from the variable, although it seems to have slightly lower numbers within the older cohort. The social_metric histogram,

hist(streamdfnum$social_metric, main = "Histogram of social metrics", 
     xlab = "Social metric", col = "darkgrey")

it shows another uniform distribution, with the extreme values having larger deviation from the norm, particularly the over representation of the lower cohort, being roughly double the norm. Now let’s look at the time_since_signup histogram,

hist(streamdfnum$time_since_signup, main = "Histogram of customer loyalty", 
     xlab = "Time since singup (in months)", col = "darkgrey")

which once again, seems to display a uniform distribution, which is what it would be expected from a company with a steady customer base. If we look at the outliers for this variable,

boxsignup <- streamdfnum$time_since_signup %>% 
  boxplot(main = "Time since signup boxplot", ylab = "", col = "grey")

there are none on display.

Repeating the above for hours_watched.

hist(streamdfnum$hours_watched, main = "Histogram of hours watched", 
     xlab = "Hours watched", col = "darkgrey")

In this case, we find a normal distribution, much as it could be expected. If we look for outliers:

boxhourswatched <- streamdfnum$hours_watched %>% 
  boxplot(main = "Hours watched boxplot", ylab = "", col = "grey")

We can observe some of them, so let’s find out exactly how many.

boxhourswatched$out

## [1] 0.79 0.50 0.50 0.80 8.30

As we can see, there are 5 outliers most of them in the lower usage part, no action will be taken at this stage, until we see how they affect correlations, or if they affect our control group in higher numbers than to the rest of the population.

Testing mean differences between the groups.

Checking the overall numbers for the group variable, based a subset dataframe from the test start date.

cutoff1 <- c("18-Jul", "19-Jul","20-Jul","21-Jul","22-Jul","23-Jul","24-Jul","25-Jul","26-Jul","27-Jul","28-Jul","29-Jul","30-Jul","31-Jul")
# somehow subsetting with the date format has proven to be diffcult
streamdfAB <- streamdf[streamdf$date %in% cutoff1, -1]
head(streamdfAB, 2)

if we have a look at the summary of this subset.

summary(streamdfAB)

##  gender       age        social_metric    time_since_signup       demographic 
##  F:192   Min.   :18.00   Min.   : 0.000   Min.   : 0.00     F 35 or less: 87  
##  M:260   1st Qu.:27.00   1st Qu.: 2.000   1st Qu.: 5.50     M 35 or less:128  
##          Median :36.00   Median : 5.000   Median :11.10     F 36+       :105  
##          Mean   :36.53   Mean   : 4.991   Mean   :11.66     M 36+       :132  
##          3rd Qu.:46.00   3rd Qu.: 8.000   3rd Qu.:18.62                       
##          Max.   :55.00   Max.   :10.000   Max.   :24.00                       
##                                                                               
##  group   hours_watched        smf        young_old        dayf           
##  A:332   Min.   :0.500   8      : 56   younger:215   Min.   :2022-07-18  
##  B:120   1st Qu.:3.664   2      : 54   older  :237   1st Qu.:2022-07-21  
##          Median :4.543   6      : 50                 Median :2022-07-24  
##          Mean   :4.510   1      : 45                 Mean   :2022-07-24  
##          3rd Qu.:5.410   3      : 43                 3rd Qu.:2022-07-28  
##          Max.   :8.300   9      : 40                 Max.   :2022-07-31  
##                          (Other):164

Overall, we can see that there is a larger proportion of Males, whereas for the rest of the variables, except demographic which is dependent from gender, the values seem to be more evenly spread. Group B is just over one third the size of A overall

In order to compare means for the hours_watched , which is the main obkective of the experiment, we are going to run an ANOVA test for both groups, which requires to test a number of assumptions.

One of them, that we know will impact the test result is that there should be an equal number of observations, in this case clearly incorrect. Even in this case, we can still run the ANOVA test for both groups, if we below assumptions are proven correct; keeping in mind that we have reduced the statistical power, which is the probability that a test will detect some effect when there actually is one, and thus we might have to gain confidence in the result with another methods.

Checking if the distributions are still normal,

histAB <- ggplot(streamdfAB) + 
  geom_histogram(aes(hours_watched, fill = group)) + 
  facet_grid(~group) + 
    labs(x = "Hours watched", 
       y = '', 
       title = "A/B histogram of hours watched")
histAB

still displaying normality, we can also see that the group B produces a slight skewness to the left, though with lower count numbers, reflecting the size of the group.

Both seem to follow normal distributions, though the showcase the stark differences in sample size.

boxABhw <- ggplot(streamdfAB) + 
  geom_boxplot(aes(x=group, y=hours_watched, fill = group)) + 
    labs(x = "", 
       y = 'Hours watched', 
       title = "A/B boxplot of hours watched")
boxABhw

We can test the normality of distribution for each group with the Shapiro-Wilk test, where the Null Hypothesis says data is drawn from a normal distribution, versus the Alternative Hypothesis says it comes from some other type of distribution.

shapiro.test(streamdfAB$hours_watched[streamdfAB$group == "A"])

## 
##  Shapiro-Wilk normality test
## 
## data:  streamdfAB$hours_watched[streamdfAB$group == "A"]
## W = 0.99603, p-value = 0.5716

shapiro.test(streamdfAB$hours_watched[streamdfAB$group == "B"])

## 
##  Shapiro-Wilk normality test
## 
## data:  streamdfAB$hours_watched[streamdfAB$group == "B"]
## W = 0.98985, p-value = 0.5208

Given that the p-value is greater than 0.05 in all cases, there is insufficient evidence to conclude that the data are drawn from some other than a normal distribution.

We also know from the data collection stage that the groups are independent. check the homogeneity of variance with Levene’s test. Here, the Null hypothesis (H0) is that the variance among the groups is equal and the alternative hypothesis (HA) the contrary.

leveneTest(hours_watched ~ group, data = streamdfAB, center = mean)

The p-value of the test is 0.938, which is more than our significance level of 0.05, thus, we cannot reject the null hypothesis and conclude that the variance among the groups is equal.

Running ANOVA, using the function aov(), where the Null hypothesis (H0) is that the means of the groups is equal and the alternative hypothesis (HA) being the contrary.

anovaquick <- aov(hours_watched ~ group, data = streamdfAB)
summary(anovaquick)

##              Df Sum Sq Mean Sq F value  Pr(>F)   
## group         1   14.7  14.741   7.907 0.00514 **
## Residuals   450  838.9   1.864                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the output we can see that the scenario is statistically significant at the 0.05 significance level, with the p-value = 0.00514.

In other words, there is a statistically significant difference between the watched_hours that results from the groups being exposed to different recommender systems.

If we run a Kruskal-Wallis Test [1], as if the variances were not equal, which is considered the non-parametric equivalent to a one-way ANOVA.

kruskal.test(hours_watched ~ group, data = streamdfAB)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  hours_watched by group
## Kruskal-Wallis chi-squared = 8.2644, df = 1, p-value = 0.004043

The result reaffirms the previous one slighlty decreasing the p-value to 0.004.

From this, we can say that there is a significant statistical difference in the amount of hours watched by customers, based solely in the performance between the new and old recommender systems.

Searching for bias in the data.

First we subset and summarise for group A:

streamdfA <- streamdfAB[streamdfAB$group =="A", ]
#head(streamdfA, 2)
summary(streamdfA)

##  gender       age        social_metric    time_since_signup       demographic
##  F:163   Min.   :18.00   Min.   : 0.000   Min.   : 0.100    F 35 or less:74  
##  M:169   1st Qu.:26.00   1st Qu.: 2.000   1st Qu.: 5.575    M 35 or less:96  
##          Median :35.00   Median : 5.000   Median :11.050    F 36+       :89  
##          Mean   :35.66   Mean   : 4.907   Mean   :11.726    M 36+       :73  
##          3rd Qu.:45.00   3rd Qu.: 8.000   3rd Qu.:18.500                     
##          Max.   :55.00   Max.   :10.000   Max.   :24.000                     
##                                                                              
##  group   hours_watched        smf        young_old        dayf           
##  A:332   Min.   :0.500   2      : 41   younger:170   Min.   :2022-07-18  
##  B:  0   1st Qu.:3.572   8      : 41   older  :162   1st Qu.:2022-07-21  
##          Median :4.360   1      : 34                 Median :2022-07-25  
##          Mean   :4.402   6      : 34                 Mean   :2022-07-24  
##          3rd Qu.:5.272   3      : 32                 3rd Qu.:2022-07-28  
##          Max.   :8.300   9      : 30                 Max.   :2022-07-31  
##                          (Other):120

From this we can see that the values in most of the variables are proportionally spread, so we will check possible mean divergences, if we find variances within the control group.

Now for B:

streamdfB <- streamdfAB[streamdfAB$group =="B", ]
#head(streamdfA, 2)
summary(streamdfB)

##  gender      age        social_metric    time_since_signup       demographic
##  F:29   Min.   :18.00   Min.   : 0.000   Min.   : 0.00     F 35 or less:13  
##  M:91   1st Qu.:31.00   1st Qu.: 3.000   1st Qu.: 5.15     M 35 or less:32  
##         Median :39.50   Median : 5.000   Median :11.35     F 36+       :16  
##         Mean   :38.94   Mean   : 5.225   Mean   :11.47     M 36+       :59  
##         3rd Qu.:47.00   3rd Qu.: 8.000   3rd Qu.:18.98                      
##         Max.   :55.00   Max.   :10.000   Max.   :23.70                      
##                                                                             
##  group   hours_watched        smf       young_old       dayf           
##  A:  0   Min.   :1.525   6      :16   younger:45   Min.   :2022-07-18  
##  B:120   1st Qu.:3.939   8      :15   older  :75   1st Qu.:2022-07-21  
##          Median :4.860   4      :14                Median :2022-07-24  
##          Mean   :4.811   2      :13                Mean   :2022-07-24  
##          3rd Qu.:5.723   1      :11                3rd Qu.:2022-07-28  
##          Max.   :7.930   3      :11                Max.   :2022-07-31  
##                          (Other):40

In this case, there are a couple of differences in gender, with “Males” being three times the observations of “Females”, plus the age proportion being 50% more in the older cohort, which differs from the overall set, where we could see a peak in the lower range in the uniform distribution.

Checking for gender bias

So let’s see if these proportions have an effect, by testing the mean values, based in those variables. Checking if the distributions are still normal,

histBgen <- ggplot(streamdfB) + 
  geom_histogram(aes(hours_watched, fill = gender)) + 
  facet_grid(~gender)
histBgen

Both seem to follow normal distributions, though the showcase the stark differences in sample size.

boxBgen <- ggplot(streamdfB) + 
  geom_boxplot(aes(x=gender, y=hours_watched, fill = gender))
boxBgen

shapiro.test(streamdfB$hours_watched[streamdfB$gender == "F"])

## 
##  Shapiro-Wilk normality test
## 
## data:  streamdfB$hours_watched[streamdfB$gender == "F"]
## W = 0.95871, p-value = 0.3054

shapiro.test(streamdfB$hours_watched[streamdfB$gender == "M"])

## 
##  Shapiro-Wilk normality test
## 
## data:  streamdfB$hours_watched[streamdfB$gender == "M"]
## W = 0.99224, p-value = 0.8761

Given that the p-values are greater than 0.05 in all cases, there is insufficient evidence to conclude that the data are drawn from some other than a normal distribution; although we can see the difference in the sample size with the lower value for “Females”.

Checking the homogeneity of variance with Levene’s test, where the Null hypothesis (H0) is that the variance among the groups is equal and the alternative hypothesis (HA) the contrary.

leveneTest(hours_watched ~ gender, data = streamdfB, center = mean)

The p-value is more than our significance level of 0.05, thus, we cannot reject the null hypothesis and conclude that the variance among the groups is equal.

Running ANOVA, using the function aov(), where the Null hypothesis (H0) is that the means of the groups is equal and the alternative hypothesis (HA) being the contrary.

anovaBgender <- aov(hours_watched ~ gender, data = streamdfB)
summary(anovaBgender)

##              Df Sum Sq Mean Sq F value Pr(>F)
## gender        1   0.33  0.3326   0.187  0.666
## Residuals   118 209.91  1.7789

The p-value is more than our significance level of 0.05, thus, we cannot reject the null hypothesis and conclude that the means among the groups is equal.

So we can say that, although there is a difference in proportions within the gender variable, there is no induced bias in the way the new algorithm works.

Checking for age group variance.

Now, if we perform the same checks for young_old, given that the proportion of older customers is 50% higher. Checking if the distributions are still normal,

histByo <- ggplot(streamdfB) + 
  geom_histogram(aes(hours_watched, fill = young_old)) + 
  facet_grid(~young_old)
histByo

boxByo <- ggplot(streamdfB) + 
  geom_boxplot(aes(x=young_old, y=hours_watched, fill = young_old)) 
boxByo

Making sure about this normality with the Shapiro-Wilk test, where the Null Hypothesis says data is drawn from a normal distribution, versus the Alternative Hypothesis says it comes from some other type of distribution.

shapiro.test(streamdfB$hours_watched[streamdfB$young_old == "younger"])

## 
##  Shapiro-Wilk normality test
## 
## data:  streamdfB$hours_watched[streamdfB$young_old == "younger"]
## W = 0.99069, p-value = 0.9729

shapiro.test(streamdfB$hours_watched[streamdfB$young_old == "older"])

## 
##  Shapiro-Wilk normality test
## 
## data:  streamdfB$hours_watched[streamdfB$young_old == "older"]
## W = 0.97931, p-value = 0.2576

Given that the p-values are greater than 0.05 in all cases, there is insufficient evidence to conclude that the data are drawn from some other than a normal distribution.

Checking the homogeneity of variance with Levene’s test, where the Null hypothesis (H0) is that the variance among the groups is equal and the alternative hypothesis (HA) the contrary.

leveneTest(hours_watched ~ young_old, data = streamdfB, center = mean)

The p-value is more than our significance level of 0.05, thus, we cannot reject the null hypothesis and conclude that the variance among the groups is equal, although it can be worth checking further, and against the group A values to compare this effect, given that the value is close to our significance level.

Running ANOVA, where the Null hypothesis (H0) is that the means of the groups is equal and the alternative hypothesis (HA) being the contrary. \[H_0: \mu_1 = \mu_2 \]

\[H_A: \mu_1 \ne \mu_2\]

anovaByo <- aov(hours_watched ~ young_old, data = streamdfB)
summary(anovaByo)

##              Df Sum Sq Mean Sq F value   Pr(>F)    
## young_old     1  59.19   59.19   46.24 4.57e-10 ***
## Residuals   118 151.05    1.28                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the output we can see that the scenario is statistically significant at the 0.05 significance level, with the p-value = 4.57e-10.

If we run a Kruskal-Wallis Test [1], as if the variances were not equal, given how close the p-value was in the Leneve Test.

kruskal.test(hours_watched ~ young_old, data = streamdfB)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  hours_watched by young_old
## Kruskal-Wallis chi-squared = 33.236, df = 1, p-value = 8.164e-09

This confirms the ANOVA result. In other words, there is a statistically significant difference between the watched_hours that results from their age.

Checking if we encounter the same situation with the group A, where the age proportions are more balanced. Checking if the distributions are still normal,

histAyo <- ggplot(streamdfA) + 
  geom_histogram(aes(hours_watched, fill = young_old)) + 
  facet_grid(~young_old)
histAyo

boxAyo <- ggplot(streamdfA) + 
  geom_boxplot(aes(x=young_old, y=hours_watched, fill = young_old))
boxAyo

shapiro.test(streamdfA$hours_watched[streamdfB$young_old == "younger"])

## 
##  Shapiro-Wilk normality test
## 
## data:  streamdfA$hours_watched[streamdfB$young_old == "younger"]
## W = 0.99376, p-value = 0.8667

shapiro.test(streamdfA$hours_watched[streamdfB$young_old == "older"])

## 
##  Shapiro-Wilk normality test
## 
## data:  streamdfA$hours_watched[streamdfB$young_old == "older"]
## W = 0.99343, p-value = 0.4788

Given that the p-values are greater than 0.05 in all cases, there is insufficient evidence to conclude that the data are drawn from some other than a normal distribution.

Checking the homogeneity of variance with Levene’s test, where the Null hypothesis (H0) is that the variance among the groups is equal and the alternative hypothesis (HA) the contrary.

leveneTest(hours_watched ~ young_old, data = streamdfA, center = mean)

The p-value is more than our significance level of 0.05, thus, we cannot reject the null hypothesis and conclude that the variance among the groups is equal. Running ANOVA, where the Null hypothesis (H0) is that the means of the groups is equal and the alternative hypothesis (HA) being the contrary.

anovaAyo <- aov(hours_watched ~ young_old, data = streamdfA)
summary(anovaAyo)

##              Df Sum Sq Mean Sq F value Pr(>F)    
## young_old     1  194.8  194.78   148.2 <2e-16 ***
## Residuals   330  433.8    1.31                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the output we can see that the scenario is statistically significant at the 0.05 significance level, with the p-value < 2e-16.

In other words, there is a statistically significant difference between the watched_hours that results from the groups taking into consideration their age, even though the effect does not change under different recommender systems.

Executives’ questions

• Is there any bias in the data collection?

From the previous test, run with the variables that have noticeable discrepancies in their proportions, we could see that there is no clear bias in the collection, better said that the data gathered in different proportions has no effect in testing the algorithm’s performance.

• How could any bias be corrected?

If the proportions of the different factors affected the test, a way to correct this could be to utilise similar size samples, or to find out what is the minimun group size for each of the variables, and allocate new tests in compliance with this quotas.

• What improvements could be made to future A/B tests?

One of the important comparative measurements could be to have individual customer data, instead of from a general pool, for the usage prior being introduced to the new algorithm, particularly regarding our control group, that way we could also compare performance, within the group, as well as the overall trend change.
Another interesting factor to consider could be the variability between the weekdays and weekends or public holidays, which we have to discern from the chart, but it could be given with a factor; as these particular days should see a natural increase in the hours watched.
Further studies could be done if we introduced a variable for special events, like public holidays or new shows or movie launches, which might naturally introduce higher viewing times, not to mention seasonal variabilities.
Increasing robustness of the testing with equal, and large enough, sample size, across variables.

Analysis

After looking at the individual variables, we are going to dive into combinations and trends, to analyse the set.

One of the take away points we have seen is the fact that the younger cohort watches more hours in average.

boxyo <- ggplot(streamdf) + 
  geom_boxplot(aes(x=young_old, y=hours_watched, fill = young_old)) + 
    labs(x = "", 
       y = 'Hours watched', 
       title = "Boxplot of hours watched grouped by age")
boxyo

Even though they have similar variability. As we have the actual ages as a numeric variable, we can check the effect it has on the hours_watched with linear regression.

qplot(streamdf$age, streamdf$hours_watched) + 
    labs(x = "Age", 
       y = 'Hours watched', 
       title = "Hours watched vs age")

If we create a simple linear model.

agelr = lm(hours_watched ~ age, data = streamdf) 

summary(agelr)

## 
## Call:
## lm(formula = hours_watched ~ age, data = streamdf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5696 -0.7474  0.0021  0.7706  2.9391 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.000118   0.123038   56.89   <2e-16 ***
## age         -0.071453   0.003236  -22.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.093 on 998 degrees of freedom
## Multiple R-squared:  0.3282, Adjusted R-squared:  0.3275 
## F-statistic: 487.5 on 1 and 998 DF,  p-value: < 2.2e-16

After seeing that there is a correlation, albeit we cannot say it is strong, as the coefficient is far from the 0.8 mark, we can extract the values we need from the model produced by lm, to create the line.

age0 <- coef(agelr)[1]
age1 <- coef(agelr)[2]
cat(paste("intercept = ", age0, " and slope = ", age1))

## intercept =  7.00011813767946  and slope =  -0.0714526979575582

Plot the fit of the line over the data.

# setup x variable with range of interest
x_agelr <- seq(min(streamdf$age), max(streamdf$age), 1)

# calculate the fitting function yfit based on the coefficients from the model
y_agelr <- age0 + age1 * x_agelr

ggagelm <- ggplot() + 
  geom_point(aes(x = streamdf$age, y = streamdf$hours_watched)) + 
  geom_line(aes(x = x_agelr, y = y_agelr), colour = 'red') + 
    labs(x = "Age", 
       y = 'Hours watched', 
       title = "Hours watched vs age with linear model")
ggagelm

Calculating the residuals to the straight line fit.

age_df <- streamdf[,c(3,8)]

age_df$y_hat <- age0 + age1 * age_df$age

# calculate the residual
age_df$error <- age_df$hours_watched - age_df$y_hat

ageres <- ggplot() + 
  geom_point(aes(x = age_df$age, y = age_df$error)) + 
    labs(x = "Age", 
       y = 'Error', 
       title = "Residuals plot age and error")
ageres

Difficult to see in this plot if the residuals are centered. We can also check that they are normally distributed.

qqnorm(age_df$error)
qqline(age_df$error, col = 2)

It fits pretty well the theoretical line, except for the top end quantiles.

F-testing of the adequacy of the linear model

Now determine the number of degrees of freedom associated with SST and how they are split between the error and regression model.

age_SST <- 1000 - 1 # total number of degrees of freedom

# One independent variable in the model, therefore df associated with SSR is 1 
age_SSR <- 1

# degrees of freedom for error term
age_SSE <- age_SST - age_SSR
cat(paste("total number of degrees of freedom = ", age_SST, 
          " and df associated with SSR = ", age_SSR, " , therefore degrees of freedom for error term = ", age_SSE))

## total number of degrees of freedom =  999  and df associated with SSR =  1  , therefore degrees of freedom for error term =  998

Then use the F test to test the null hypothesis, which is that the model does not provide an adequate fit to the data.

age_mean <- mean(age_df$hours_watched)

# calculate the sum of squares terms
SSRage <- sum( (age_df$y_hat - age_mean)^2 )
SSEage <- sum( (age_df$hours_watched - age_df$y_hat)^2 )

# calculate the F statistic
f_score_age <- (SSRage/age_SSR) / (SSEage/age_SSE)

# convert to p value
p_value_age <- pf(f_score_age, age_SSR, age_SSE, lower.tail = FALSE)
print(paste('F =', f_score_age))

## [1] "F = 487.451763253224"

print(paste('p-value =', p_value_age))

## [1] "p-value = 2.83182641298255e-88"

# compare to the in-built function in R
summary(agelr)

## 
## Call:
## lm(formula = hours_watched ~ age, data = streamdf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5696 -0.7474  0.0021  0.7706  2.9391 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.000118   0.123038   56.89   <2e-16 ***
## age         -0.071453   0.003236  -22.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.093 on 998 degrees of freedom
## Multiple R-squared:  0.3282, Adjusted R-squared:  0.3275 
## F-statistic: 487.5 on 1 and 998 DF,  p-value: < 2.2e-16

Thus we can be confident that there is a statistically significant simple linear relationship.

Goodness of fit (R squared)

Calculate the SSR/SST

# Read off Multiple R-squared in the summary or calculate as follows
SSTage <- SSRage + SSEage
R_squared_age <- SSRage/SSTage
R_squared_age

## [1] 0.3281505

This tells us that 33% of the variation is the y variable is explained by the variation in the x variable, which indicates that the simple linear regression model provides a good fit to the data.

Looking at wider trends in the data.

First, we should focus on the overall trend for the hours watched; for this, we need to subset and sum the hours watched per day.

hwtrend <- streamdf %>% 
  dplyr::select(., c("dayf", "hours_watched"))

hwtrendstats <- hwtrend %>% group_by(dayf) %>% 
    dplyr::summarise(total_hw = sum(hours_watched),
            avg_hw = mean(hours_watched),
            sd_hw = sd(hours_watched)
  )
head(hwtrendstats, 2)

With the overall trend being.

hwsmoothtot <- ggplot(hwtrend, aes(dayf, hours_watched)) + 
  geom_smooth(method = lm, formula = y ~ splines::bs(x, 3), binwidth=30, colour = "red") + 
  ylab("Frequency") + 
  xlab("Date") + # Year and Month #ideally but the transformation was not working 
  theme_bw() + 
  theme(axis.text.x = element_text(angle=90) ) + 
  ggtitle("Total hours watched time series")
hwsmoothtot

There is a naturally increasing amount of hours watched, but if we look more in detail, with a slight inflection point before the 18th of July, if we plot the totals per day. If we want to make sure that this is the case.

hwavg <- ggplot(hwtrendstats) + 
  geom_smooth(aes(dayf, avg_hw), binwidth=31, colour = "red", se = FALSE) + 
  geom_smooth(aes(dayf, avg_hw + sd_hw), binwidth=31, colour = "darkgrey", se = FALSE) + 
  geom_smooth(aes(dayf, avg_hw - sd_hw), binwidth=31, colour = "darkgrey", se = FALSE) + 
  geom_line(aes(dayf, avg_hw), colour = "blue") + 
  geom_vline(xintercept = as.Date("2022-07-18"), 
             linetype="dashed", color = "blue", size=1.5) + 
  ylab("Frequency") + 
  xlab("Date") + # Year and Month #ideally but the transformation was not working 
  theme_bw() + 
  theme(axis.text.x = element_text(angle=90) ) + 
  ggtitle("Average hours watched time series")
hwavg

When we add one standard deviation, we can see the trend clearly, although less sharply than if we neglected those boundaries.

hwavg2 <- ggplot(hwtrendstats) + 
  geom_smooth(aes(dayf, avg_hw), binwidth=31, colour = "red", se = FALSE) + 
  geom_line(aes(dayf, avg_hw), colour = "blue") + 
  geom_vline(xintercept = as.Date("2022-07-18"), 
             linetype="dashed", color = "blue", size=1.5) + 
  ylab("Frequency") + 
  xlab("Date") + # Year and Month #ideally but the transformation was not working 
  theme_bw() + 
  theme(axis.text.x = element_text(angle=90) ) + 
  ggtitle("Average hours watched time series")
hwavg2

There seems to be an event on the 8th or 9th of July that generated higher than viewing time, which disturbed the differences in the overall trend.

Using a correlation matrix to find more dependencies.

Plotting the corrplot we can see other relationships between numeric variables.

res <- cor(streamdfnum)
corrplot(res, method = 'number') # colorful number

Besides the age, we can see a small correlation between the social_metric and the hours_watched, if we plot the values.

qplot(streamdfnum$social_metric, streamdfnum$hours_watched)

And then create a simple linear model.

socslr = lm(hours_watched ~ social_metric, data = streamdfnum) 

summary(socslr)

## 
## Call:
## lm(formula = hours_watched ~ social_metric, data = streamdfnum)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7986 -0.8364  0.0264  0.8577  3.6902 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.88366    0.07881   49.28  < 2e-16 ***
## social_metric  0.10373    0.01370    7.57 8.48e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.297 on 998 degrees of freedom
## Multiple R-squared:  0.0543, Adjusted R-squared:  0.05335 
## F-statistic:  57.3 on 1 and 998 DF,  p-value: 8.481e-14

There is a weak correlation between the hours_watched and the social_metric, though given the number of factors, and also the fact that we have no deep knowledge as to how these values are derived or assigned, there will be no further exploration of this relationship.

Reference

[1] Zach (2021). *How to Perform an ANOVA with Unequal Sample Sizes*. Viewed June 14, 2022, from statology.org:< https://www.statology.org/anova-unequal-sample-size/ >

Assessment 3: Business case and presentation

MATH2406 Applied Analytics

Samuel Klett Navarro (s3039156)

Last updated: 16 June, 2022

Overview

Introduction

Problem Statement

Setup

The sample data

Loading the data

Data manipulation and single variable discussion

Testing mean differences between the groups.

Searching for bias in the data.

Checking for gender bias

Checking for age group variance.

Executives’ questions

Analysis

Goodness of fit (R squared)

Calculate the SSR/SST

Looking at wider trends in the data.

Using a correlation matrix to find more dependencies.

Reference