Introduction

This project is based on data from an online movie streaming company who have just conducted an A/B test on their users to analyze whether the new recommendation system they trialed can result in an increase in the amount of hours watched.

Competition between the streaming titans is heating up and the competition for user engagement through increasingly well targeted content has never been higher. An increase in user engagement means more hours of streaming and plenty of opportunities for advertising revenue and therefore company profit.

The streaming company has built up a healthy customer base with good content offerings but is continually refining their recommendation engine to provide better recommendations to customers. Better recommendations improve user engagement and importantly increase the average hours watched per user per day, a key metric used to price ads for 3rd party marketing companies. The company executives want to know if the new recommendation engine algorithm is worth rolling out to all their subscribers. They want to analyse the results from a recent change they made in their recommendation engine and present the findings to the executive team.

They are also interested in any other insights you may learn from the sample data.

In particular, they are curious about the following:

• Is there any bias in the data collection?

• How could any bias be corrected?

• What improvements could be made to future A/B tests?

The sample data shows the effect of an A/B test conducted to measure the effectiveness of a change to the recommendation engine used on some subscribers, but not others. The change to the recommendation went live at 1-minute past midnight on the 18th of July. Those customers who were unknowingly using the new recommendation engine to suggest what to watch next are labelled as group B, while group A was used as a control group.

options(warn= -1)
library("PerformanceAnalytics")
library(ggplot2)
library(dplyr)
library(cowplot)
library(ggthemes)
library(reshape)
library(readr)
library(plotly)
library(corrplot)
library(RColorBrewer)
library(gplots)
df <- read_csv("streaming_data.csv")
head(df)

## # A tibble: 6 x 8
##   date  gender   age social_metric time_since_sign~ demographic group
##   <chr> <chr>  <dbl>         <dbl>            <dbl>       <dbl> <chr>
## 1 01/07 F         28             5             19.3           1 A    
## 2 01/07 F         32             7             11.5           1 A    
## 3 01/07 F         39             4              4.3           3 A    
## 4 01/07 M         52            10              9.5           4 A    
## 5 01/07 M         25             1             19.5           2 A    
## 6 01/07 M         51             0             22.6           4 A    
## # ... with 1 more variable: hours_watched <dbl>

dim(df)

## [1] 1000    8

Information about the data set

The data contains 8 columns and 1000 rows of data. The data set has the following fields (columns):

• date in format dd/mm

• gender of the customer

• age of the customer

• social metric, which is a combined metric based on previous viewing habits

• number of months since the customer signed up

• demographic number

• group (A/B) where A is the control and B is the treated group

• number of hours watched in that day

#Lets create a frequency table to assess the number of people in each group (A or B)
table(df$group)

## 
##   A   B 
## 880 120

# Group A has 880 people and Group B has 120 people
880/1000 #Calculate what percentage of the total participants are in group A

## [1] 0.88

120/1000 #Calculate what percentage of the total participants are in group B

## [1] 0.12

#Making a new data frame vector for the count and proportion of Group A and Group B
count.data <- data.frame(
  Group = c("A", "B"),
  n = c(880,120),
  prop = c(88, 12)
)
count.data

##   Group   n prop
## 1     A 880   88
## 2     B 120   12

# Add label position
count.data <- count.data %>%
  arrange(desc(Group)) %>%
  mutate(lab.ypos = cumsum(prop) - 0.5*prop)
count.data

##   Group   n prop lab.ypos
## 1     B 120   12        6
## 2     A 880   88       56

# Building a pie chart
# Picking colors
mycols <- c("#0073C2FF", "#CD534CFF")

#Bar plot of counts in each group
GroupsBarPlot = ggplot(data=df, aes(group)) +
  geom_bar(fill=mycols)+
  ylab("Number of participants")+
  xlab("Group")+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
  ggtitle("Number of participants by group")+
  geom_text(stat='count',aes(label=..count..), vjust=2.1, nudge_y=0.9)+
  theme(plot.title = element_text(hjust=0.5))


#Making the pie chart
# Add in title and maybe a legend
GroupsPiePlot = ggplot(count.data, aes(x = "", y = prop, fill = Group)) +
  geom_bar(width = 1, stat = "identity", color = "white") +
  coord_polar("y", start = 0)+
  geom_text(aes(y = lab.ypos, label = paste(prop,"%")), color = "white")+
  scale_fill_manual(values = mycols) +
  theme_void()+
  ggtitle("Percentage breakdown of participants\nby group")+
  theme(plot.title = element_text(hjust=0.5))

# Plot both charts on the same grid (Graph 1)
plot_grid(GroupsBarPlot, GroupsPiePlot,nrow = 1, ncol=2)

Graph 1 Findings:

We can see from graph 1 that Group A (the control group) contains 88% of the participants and Group B (the treated group) contains 12% of the participants. Typically for an A/B test you want more people in your treatment group in order to try and maximize profits. Having a higher percentage of participants in the treated group than the control group is the first thing that should be changed for future A/B tests.

Lets now investigate the gender breakdown of the participants:

#Lets repeat this entire process for gender
table(df$gender)

## 
##   F   M 
## 429 571

# There are 429 Females and 571 Males in the dataset
429/1000

## [1] 0.429

571/1000

## [1] 0.571

#Making a new data frame vector for the count and proportion of Group A and Group B
Gender.data <- data.frame(
  Gender = c("Females", "Males"),
  n = c(429,571),
  prop = c(42.9, 57.1)
)
Gender.data

##    Gender   n prop
## 1 Females 429 42.9
## 2   Males 571 57.1

# Add label position
Gender.data <- Gender.data %>%
  arrange(desc(Gender)) %>%
  mutate(lab.ypos = cumsum(prop) - 0.5*prop)
Gender.data

##    Gender   n prop lab.ypos
## 1   Males 571 57.1    28.55
## 2 Females 429 42.9    78.55

# Building a pie chart
# Picking colors
mycols <- c("#0073C2FF", "#CD534CFF")

#Bar plot of counts in each group
GenderBarPlot = ggplot(data=df, aes(gender, fill=gender)) +
  geom_bar(fill=mycols) + #Make a bar chart
  labs(x= "Gender", y= "Number of People", size=1) + #Add axis labels
  geom_text(aes(label= ..count..), stat="count", position=position_dodge(width=0.9), vjust=-0.25) + #Label the bars
  theme(axis.text.x = element_text(size = 8), panel.border = element_rect(linetype = "dashed", fill = NA), 
        panel.background = element_rect(fill='white'), plot.title = element_text(size= 12, hjust=.5)) + 
  #Customize the theme elements
  ggtitle("Number of Males and Females") #Add a title


#Making the pie chart
GenderPiePlot = ggplot(Gender.data, aes(x = "", y = prop, fill = Gender)) +
  geom_bar(width = 1, stat = "identity", color = "white") +
  coord_polar("y", start = 0)+ #Change the shape to a pie
  geom_text(aes(y = lab.ypos, label = prop), color = "white")+ #Add the label we created earlier
  scale_fill_manual(values = mycols) +
  theme_void() + #Void the theme
  theme(plot.title = element_text(size= 12, hjust=-0.4), axis.text.y = element_text(size = 6)) + #Change title size and position
  ggtitle("Percentage breakdown of Males and Females") #Add title

# Plot 2
# Plot both charts on the same grid
plot_grid(GenderBarPlot, GenderPiePlot,nrow = 1, ncol=2)

Graph 2 Findings:

We can see from graph 2 that there are more males participants than females, however the percentages are fairly close to being 50/50.

Lets now investigate the gender breakdown per group:

#Make a bar chart for both genders by group
#Make a pie chart for both genders by group

table(df$gender, df$group)

##    
##       A   B
##   F 400  29
##   M 480  91

#We can see there is a large gender imbalance in the treatment group (Group B), there are only 29 females yet there are 91 males
400/1000

## [1] 0.4

29/1000

## [1] 0.029

480/1000

## [1] 0.48

91/1000

## [1] 0.091

#Making a new data frame vector for the count and proportion of Group A and Group B
GenGroup.data <- data.frame(
  GenderGroup = c("Male Group A", "Female Group A", "Male Group B", "Female Group B"  ),
  n = c(480, 400, 91, 29),
  prop = c(48, 40, 9.1, 2.9)
)
GenGroup.data

##      GenderGroup   n prop
## 1   Male Group A 480 48.0
## 2 Female Group A 400 40.0
## 3   Male Group B  91  9.1
## 4 Female Group B  29  2.9

# Add label position
GenGroup.data <- GenGroup.data %>%
  arrange(desc(GenderGroup)) %>%
  mutate(lab.ypos = cumsum(prop) - 0.5*prop)
GenGroup.data

##      GenderGroup   n prop lab.ypos
## 1   Male Group B  91  9.1     4.55
## 2   Male Group A 480 48.0    33.10
## 3 Female Group B  29  2.9    58.55
## 4 Female Group A 400 40.0    80.00

#Vector of colours to change to
GenGroupBarCols <- c("#CD534CFF", "#868686FF", "#EFC000FF", "#0073C2FF")


#Bar chart of all four gender groups
GenGroupBar = ggplot(data=GenGroup.data, aes(x=GenderGroup, y=n)) +
  geom_col(fill=GenGroupBarCols) + #Make a column chart
  labs(x = "Gender Group", y= "Number of People", size=1) + #Label axis'
  geom_text(size=3.5,aes(label=n), position=position_dodge(width=0.9), vjust=-0.25) + #Label the bars
  theme(axis.text.x = element_text(size = 8, angle=45, vjust=1.2, hjust=1.1), panel.border = element_rect(linetype = "dashed", fill = NA), 
        panel.background = element_rect(fill='white'), plot.title = element_text(size= 12, hjust=.5)) + #customize theme
  ggtitle("Number of People by Gender in each Group") #Add title


#Vector of colours to change to
GenGroupCols <- c("#0073C2FF", "#EFC000FF", "#868686FF", "#CD534CFF")

# Pie chart of all four gender groups
GenGroupPiePlot = ggplot(GenGroup.data, aes(x = "", y = prop, fill = GenderGroup)) +
  geom_bar(width = 1, stat = "identity", color = "white") +
  coord_polar("y", start = 0)+ #Create the pie shape
  geom_text(aes(y = lab.ypos, label = prop), color = "black", size=4)+ #Use labels we created earlier
  scale_fill_manual(values = GenGroupCols) + #Customize colours of the pie
  theme_void() + #Void theme
  theme(plot.title = element_text(size= 12, hjust=-0.4), axis.text.y = element_text(hjust=0.5,size = 4.5)) + #Customize theme elements
  ggtitle("Percentage breakdown of Gender Groups") #Add title

# Plot both charts on the same grid
#This is the two plots of all four gender groups
# Plot 3
plot_grid(GenGroupBar, GenGroupPiePlot,nrow = 1, ncol=2)

Graph 3 Findings:

We can see that the number of males in group A and females in group A is comparable, which is good. However more importantly we can see there are more than three times the number of males in group B (the treatment group) than there are females in group B. This means that group B is heavily biased towards observing how men respond to the new recommendation algorithm. Additionally, identifying any major gender differences in response to the new recommendation algorithm will be difficult to detect due to the small number of females in group B. This is the first bias in the data

Compare the spread of hours watched between Groups A & B

GroupA <- filter(df, group=='A')
GroupB <- filter(df, group=='B')

summary(GroupA$hours_watched)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.500   3.487   4.355   4.336   5.250   8.300

summary(GroupB$hours_watched)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.525   3.939   4.860   4.811   5.723   7.930

We can see that on average Group B watches roughly an extra half hour more than group A.

Examine the relationship between group & hours watched, by age

# Plot 4.0
ggplot(data=df, aes(x=age,y=hours_watched, colour=group))+
  geom_point(aes(x=age,y=hours_watched))+
  geom_smooth(method="lm") + #Add linear regression line
  labs(y="Number of hours watched", x="User Age (in years)") + #Label axis'
  ggtitle("Number of Hours Watched by User Age per Group") + #Add title
  scale_y_continuous(breaks=seq(0, 9, 0.5), sec.axis = dup_axis()) + #Set y axis tick marks from 0 to 9 every 0.5
  scale_x_continuous(breaks=seq(0, 56, 2)) + # x axis Tick marks from 0-56, every 2
  theme(panel.background = element_rect(fill="#f4edca",color="pink"), #Customize theme elements
        plot.background = element_rect("#46dbdf"),
        plot.title = element_text(hjust= 0.5), 
        panel.border = element_rect(linetype = "dashed", fill = NA))

## `geom_smooth()` using formula 'y ~ x'

Graph 4 Findings:

Here we get our first look at the effects of the A/B test and we can see that on average, Group B (the treated group) watches 45 minutes more than Group A (the control group), which is the desired outcome. The graph also shows us that the number of hours watched is negatively correlated to user age. That is, on average, with an increase in user age comes a decrease in the amount of hours watched, irrespective of group. We can also notice that there are no 22 year olds in Group B, which is how we know the subject groups have not been stratified by age. Now we want to know, does this matter?

# QQ Plot of the population
PopQQPlot <- ggplot(df, aes(sample=age)) +
  stat_qq() + #Add QQ stat
  stat_qq_line(color = "red") + #Make QQPlot line red
  ggtitle("Population age distribution")+ #Add title
  theme(plot.title = element_text(size=11, hjust=.5), panel.border = element_rect(linetype = "dashed", fill=NA), plot.background = element_rect("cornflowerblue"))
##Centre title

AQQPlot <- ggplot(GroupA, aes(sample=age)) +
  stat_qq() +
  stat_qq_line(color = "red") + #Make the QQ Line red
  ggtitle("Group A age distribution")+ #Add title
  theme(plot.title = element_text(size=11, hjust=.5), panel.border = element_rect(linetype = "dashed", fill=NA), plot.background = element_rect("cornflowerblue"))

BQQPlot <- ggplot(GroupB, aes(sample=age)) +
  stat_qq() +
  stat_qq_line(color = "red") + #Make the QQ Line red
  ggtitle("Group B age distribution")+ #Add title
  theme(plot.title = element_text(size=11, hjust=.5), panel.border = element_rect(linetype = "dashed", fill=NA), plot.background = element_rect("cornflowerblue"))
#Centre title

#plot the QQ Plots on the same grid
plot_grid(PopQQPlot, AQQPlot, BQQPlot, nrow = 1, ncol=3)

Graph 5 Findings:

The above Quantile-Quantile plot (graph 5) shows us that groups A and B follow the same distribution as the population, so the answer to our question, does it matter that there are no 22 year olds in Group B, is no, it does not matter as the age distributions of Group A and Group B closely represent the age distribution for the population. Meaning that the increase in the number of hours watched in Group B is not due to any differences in the ages of the two groups (for example if Group B were much younger than Group A, this would impact the relevance of the difference in the amount of hours watched).

Calculating the descriptive statistics for participants grouped by Age & Test Group

AgeSummary = df %>% group_by(age, group) %>%
  summarize(avg_hours = mean(hours_watched), #Mean avg hours for each age  and group
            SD = sd(hours_watched), #Standard deviation for each age and group
            Min = min(hours_watched), #Minimum age for each age and group
            Q1 = quantile(hours_watched,probs =.25), #Quartile 1 for each age and group
            Median = median(hours_watched), #Median for each age and group
            Q3 = quantile(hours_watched,probs = .75), #Quartile 3 for each age and group
            Max = max(hours_watched), #Max age for each age and group
            IQR = IQR(hours_watched), .groups = "keep") #IQR for each age and group. 
#To suppress the warning we add groups = "keep"
AgeSummary #See what the variable looks like

## # A tibble: 75 x 10
## # Groups:   age, group [75]
##      age group avg_hours       SD   Min    Q1 Median    Q3   Max     IQR
##    <dbl> <chr>     <dbl>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
##  1    18 A          5.66  0.936    3.73  5.18   5.80  6.32  6.98 1.13   
##  2    18 B          6.04  1.18     5.21  5.63   6.04  6.46  6.88 0.835  
##  3    19 A          5.60  1.18     3.43  4.95   5.29  6.32  8.3  1.36   
##  4    19 B          6.16  0.00707  6.15  6.15   6.16  6.16  6.16 0.00500
##  5    20 A          5.43  1.19     2.06  4.74   5.64  6.15  7.45 1.41   
##  6    20 B          3.94 NA        3.94  3.94   3.94  3.94  3.94 0      
##  7    21 A          4.91  1.31     1.93  4.28   5.12  5.46  7.4  1.18   
##  8    21 B          6.40  0.969    5.72  6.06   6.40  6.75  7.09 0.685  
##  9    22 A          5.16  0.944    3.44  4.64   4.96  5.76  7.67 1.12   
## 10    23 A          5.46  0.997    3.52  4.74   5.64  5.98  7.52 1.23   
## # ... with 65 more rows

# We've already determined that both groups follow a normal distribution for age

#Convert the dates to just a number
DateAsNumber <- df$date
DateNumber <- substr(DateAsNumber, start=1, stop=2)
df['DateNumber'] <- DateNumber

#Make the column numeric
df$DateNumber<- as.numeric(df$DateNumber)

#Plot 6.0
#This graph shows the difference over the month between group A and B
ggplot(data=df, aes(x=DateNumber, y=hours_watched, colour=group))+
  geom_point(aes(x=DateNumber, y=hours_watched))+ #Make a scatterplot
  geom_smooth(method="lm") + #Add linear regression line
  scale_y_continuous(breaks=seq(0, 9, 0.5), sec.axis = dup_axis()) + #Set y axis tick marks from 0 to 9 every 0.5
  scale_x_continuous(breaks=seq(0, 32, 2)) + #Set the x axis values from 0 to 32 every 2
  labs(y="Number of hours watched", x="Date of the month for July") + #Label the x and y axis'
  ggtitle("Number of Hours Watched by Date of the month (July)") +  #Give a title
  theme(panel.background = element_rect(fill="#f4edca",color="pink"), #Customize the colours
        plot.background = element_rect("#FF6666"),
        plot.title = element_text(hjust= 0.5), #Centre the title, give a dashed border 
        panel.border = element_rect(linetype = "dashed", fill = NA))

## `geom_smooth()` using formula 'y ~ x'

Graph 6 Findings:

This graph (graph 6) highlights another important design flaw in the A/B test; as we can see from our regression lines the treatment group (group B) only contains data from participants on and after the 18th of July. Meaning any natural fluctuations throughout the course of a month cannot be fully observed in the B group. We can also see that users in both groups tend to watch more at the end of the month, the control group (group A) tends to watch about 15 minutes more per day at the end of the month compared to the start of the month. This is about a 5% increase. We can also see that the treatment group (group B) watches an extra half an hour per day more than the control group. This equates to more than a 10% difference.

Create a frequency table of people after the 18th to see how many females there were

After18th <- filter(df, DateNumber>=18)
table(After18th$gender, After18th$DateNumber)

##    
##     18 19 20 21 22 23 24 25 26 27 28 29 30 31
##   F 18 19  9 11 13 12 11 14 11 17 15 16 12 14
##   M 14 13 24 21 19 20 22 18 21 15 18 16 20 19

There are 192 females from the 18th onwards and 260 males from the 18th onwards.We know there are only 29 females in group B, so we now know the participants have definitely not been stratified by gender, even after the 18th of the month.

ABAgeVec <-table(After18th$age, After18th$group)
ABAgeVec

##     
##       A  B
##   18  6  2
##   19 14  2
##   20 17  1
##   21  9  2
##   22  6  0
##   23 12  2
##   24  7  2
##   25  9  2
##   26  7  5
##   27  8  3
##   28 11  3
##   29  6  1
##   30  8  2
##   31 15  4
##   32  8  2
##   33  8  3
##   34 13  5
##   35  6  4
##   36  8  5
##   37  7  4
##   38 10  4
##   39  6  2
##   40  8  1
##   41  9  4
##   42  7  2
##   43  8  2
##   44  7  8
##   45 11  7
##   46  7  5
##   47  7  4
##   48 10  2
##   49  9  3
##   50  9  4
##   51  5  3
##   52 12  4
##   53  7  6
##   54 11  2
##   55  4  3

We can notice that there are no 22 year olds in group B, this is an important point when figuring out how subjects were stratified into their groups. They were clearly not stratified by age as there are six 22 year old’s in group A after the 18th (when the study group began).

Examining the data by the Demographic variable

table(df$demographic)

## 
##   1   2   3   4 
## 216 268 213 303

table(df$gender, df$demographic)

##    
##       1   2   3   4
##   F 216   0 213   0
##   M   0 268   0 303

We can see that there are a comparable number of participants in each demographic but we can see there are no females in demographic 2 or 4 and no males in demographic 1 or 3.

table(df$group, df$demographic)

##    
##       1   2   3   4
##   A 203 236 197 244
##   B  13  32  16  59

We can also see that participants are not evenly split and thus have not been stratified by demographic.

Examining the data by the Social Metric variable

table(df$gender, df$social_metric)

##    
##      0  1  2  3  4  5  6  7  8  9 10
##   F 24 52 40 52 46 34 42 36 41 44 18
##   M 35 57 63 44 64 51 47 50 72 53 35

table(After18th$gender, After18th$social_metric)

##    
##      0  1  2  3  4  5  6  7  8  9 10
##   F 13 22 17 23 14 19 24 14 19 19  8
##   M 13 23 37 20 24 19 26 18 37 21 22

We can see that social metric is fairly evenly split and was likely a block used to stratify subjects. We can also see that after the 18th (when the study began) is still fairly evenly distributed by social metric groups.

Calculating the descriptive statistics

SocialMetricSummary = df %>% group_by(social_metric, group) %>%
  summarize(avg_hours = mean(hours_watched), #Mean avg hours for each social metric and group
            SD = sd(hours_watched), #Standard deviation for each social metric and group
            Min = min(hours_watched), #Minimum age for each social metric and group
            Q1 = quantile(hours_watched,probs =.25), #Quartile 1 for each social metric and group
            Median = median(hours_watched), #Median for each social metric and group
            Q3 = quantile(hours_watched,probs = .75), #Quartile 3 for each social metric and group
            Max = max(hours_watched), #Max age for each social metric and group
            IQR = IQR(hours_watched), .groups = "keep") #IQR for each social metric and group. 
#To suppress the warning we add groups = "keep"
SocialMetricSummary #See what the variable looks like

## # A tibble: 22 x 10
## # Groups:   social_metric, group [22]
##    social_metric group avg_hours    SD   Min    Q1 Median    Q3   Max   IQR
##            <dbl> <chr>     <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
##  1             0 A          3.74  1.28  0.79  3.27   3.57  4.59  6.45  1.32
##  2             0 B          3.36  1.36  2.52  2.58   2.64  3.78  4.92  1.20
##  3             1 A          3.84  1.21  0.8   3.26   3.88  4.58  6.98  1.32
##  4             1 B          4.71  1.35  2.80  3.8    4.43  5.47  6.88  1.67
##  5             2 A          4.22  1.47  0.5   3.22   4.31  5.20  7.67  1.97
##  6             2 B          4.33  1.27  1.52  3.70   4.54  5.23  5.98  1.53
##  7             3 A          4.28  1.29  1.43  3.49   4.33  5.14  7.09  1.65
##  8             3 B          4.04  1.45  1.59  3.02   3.94  5.36  6.33  2.33
##  9             4 A          4.29  1.29  0.5   3.49   4.38  5.28  7.09  1.80
## 10             4 B          4.12  1.57  2.12  3.00   3.82  5.10  7.61  2.10
## # ... with 12 more rows

Now lets plot the means

#Graph 7
ggplot(data=SocialMetricSummary, aes(x= social_metric, y=avg_hours)) +
  geom_point(aes(x= as.factor(social_metric), y=avg_hours, color=group), size=2) +
  geom_text(label=round(SocialMetricSummary$avg_hours,digits=2), size=2.5, nudge_x = 1.25) + #Label each point
  scale_y_continuous(breaks=seq(0, 10, .5), # Tick marks from 0 to 10 every 0.5
                     sec.axis = dup_axis()) + #Make duplicate y axis' 
  labs(y="Number of Hours Watched", x="Social Metric") + #Label axis'
  ggtitle("Mean Number of Hours Watched per Group for each Social Metric") + #Add title
  theme(panel.background = element_rect(fill="#f4edca",color="pink"), #Customize theme elements
        plot.background = element_rect("#56B4E9"),
        plot.title = element_text(hjust= 0.5), 
        panel.border = element_rect(linetype = "dashed", fill = NA))

Graph 7 Findings:

These results are very interesting, graph 7 shows us that the treatment did not increase hours watched for all social metric groups; it increased hours watched for 8 out of the 11 groups, the treatment group watched less hours for social metrics 0, 3 and 4.

Lets interrogate these results further:

Now lets run a significance test on the means from the graph above. Our null hypothesis (H0) is that the mean hours watched for Group B is greater than the mean hours watched of Group A.

t.test(SocialMetricSummary$avg_hours~SocialMetricSummary$group, alt="two.sided", conf=0.95, var.eq=F, paired=F)

## 
##  Welch Two Sample t-test
## 
## data:  SocialMetricSummary$avg_hours by SocialMetricSummary$group
## t = -1.6565, df = 14.146, p-value = 0.1196
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.8974471  0.1148458
## sample estimates:
## mean in group A mean in group B 
##        4.337245        4.728546

Based on the p-value we cannot accept the null hypothesis, the difference between means was not statistically significant enough.

ANOVA

We will now conduct an Analysis of Variance (ANOVA) on the differences among means.

The null nypothesis (H0) is that the means of all groups are equal

aov(df$hours_watched~df$group)

## Call:
##    aov(formula = df$hours_watched ~ df$group)
## 
## Terms:
##                  df$group Residuals
## Sum of Squares    23.8009 1751.6242
## Deg. of Freedom         1       998
## 
## Residual standard error: 1.324815
## Estimated effects may be unbalanced

summary(aov(df$hours_watched~df$group))

##              Df Sum Sq Mean Sq F value   Pr(>F)    
## df$group      1   23.8  23.801   13.56 0.000243 ***
## Residuals   998 1751.6   1.755                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p value is <0.001, therefore we must accept the null hypothesis.

Lets calculate the variances for each group, by Social Metric:

#Calculate the variance for Group A
var(SocialMetricSummary$avg_hours[SocialMetricSummary$group=="A"])

## [1] 0.109472

#Calculate the variance for Group B
var(SocialMetricSummary$avg_hours[SocialMetricSummary$group=="B"])

## [1] 0.5043152

We can see that Group B has five times the variance of Group A.

What is the sample size for each social metric?

# Plot 8
ggplot(data=df, aes(as.factor(social_metric), fill=group)) +
  geom_bar(position="dodge") + #Make the bars not stacked
  scale_y_continuous(breaks=seq(0, 150, 5), sec.axis = dup_axis()) + # Y axis Ticks from 0-150, every 5
  labs(y="Number of People", x="Social Metric") + #Add axis labels
  geom_text(aes(label= ..count..), stat="count", position=position_dodge(width=0.9), vjust=-0.25) + #Label bars
  ggtitle("Number of People per Group for each Social Metric") + #Add title
  theme(panel.background = element_rect(fill="#f4edca",color="pink"), #Customize theme elements, 
        plot.background = element_rect("#59BEC4"), #customize colours
        plot.title = element_text(hjust= 0.5), #centre title, create dashed border
        panel.border = element_rect(linetype = "dashed", fill = NA))

Graph 8 Findings:

We can see from graph 8 that group B has a low sample size for all social metric groups, especially for social metric 0 which was one of the three social metric groups where group B actually had lower hours watched than the control group.

Lets figure out what the minimum sample size needed for statistical significance is

We will use the following equation to calculate the required minimum sample size: \[ n_{ss} = z_{alpha}^2 \left( \frac{ \sigma }{E} \right)^2 \]

In order to calculate a minimum sample size per group \[n_{ss}\]

We need to define a confidence interval \[z_{alpha}^2\]
We need to estimate what the standard deviation is for hours watched \[{\sigma}\]
And we need to estimate what the effect size between groups A and B will be \[{E}\]

It’s important to observe from the above equation that the larger the effect size, the lower sample size needed per group.

How many participants are needed in each group to achieve statistical significance?

summary(GroupA$hours_watched)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.500   3.487   4.355   4.336   5.250   8.300

summary(GroupB$hours_watched)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.525   3.939   4.860   4.811   5.723   7.930

sd(df$hours_watched) #The standard deviation is 1.333118

## [1] 1.333118

#Mean Hours watched of group B divided by mean hours watched of control group - group A
4.811 / 4.336

## [1] 1.109548

# Group B watches 10.9% more hours than Group A

We can see the standard deviation is 1.333118 and that the estimated effect size (the difference in hours watched between group A and group B) is 10.9%. Taking a 95% confidence level we can calculate the minimum sample size as

# Minimum sample size calculation
z_alpha <- 1.96
effect_est <- 0.109
sd_est <- 1.33
n_ss <- ceiling((z_alpha * sd_est / effect_est)^2)
print(paste('Min sample size', n_ss))

## [1] "Min sample size 572"

The minimum sample size is 572 for EACH group, meaning we do not have enough people in group B to achieve statistically significant results.

Now lets calculate what the minimum sample size would be for each social metric, using the same equation as above:

The greatest effect size (remember it’s in terms of percentage) is social metric 1. Therefore we will use social metric 1 to calculate the minimum sample size, with the knowledge that all other social metric groups would need a larger minimum sample size to achieve statistical significance (meaning other social metric groups will need more participants than this).

#Social metric #1 minimum sample size using the population SD
4.71 / 3.84

## [1] 1.226562

# 1.226562

z_alpha3 <- 1.96
effect_est3 <- 0.226562
sd_est3 <- 1.33
n_ss3 <- ceiling((z_alpha3 * sd_est3 / effect_est3)^2)
print(paste('Min sample size', n_ss3))

## [1] "Min sample size 133"

We can see the minimum sample size per group for Social Metric 1 is 133 participants, all other social metrics will need more participants than this.

Probing for relationships in the data:

# subset only numerical columns
M = cor( df[,c(3:5,8,9)], method = "pearson" )
# Make the correlation plot. Plot 8
corrplot(M, type="upper", # To make diagonal matrix 
         order="hclust", # to order based on similarity
         col=brewer.pal(n=8, name="RdYlBu")) # For the colour scale

Graph 9 Findings:

The above plot (graph 9) shows us the Pearson’s correlation co-efficient between numeric groups in our data set. Pearson’s correlation co-efficient shows us the relationship (or lack of) between two numeric variables, it shows us the strength and the direction of the relationship. For example a score of +1 would indicate a 100% positive relationship; as variable X increases, so does variable Y. A -1 score would indicate a 100% negative relationship; for example when variable X increases, variable Y decreases. We can see that there is a moderate to strong negative relationship between hours watched and age, which we have already observed in graph 4. We can see that there is a weak positive relationship between the date of the month and hours watched, which we also previously observed in graph 6. We can also see there is a moderate positive relationship between social metric and hours watched, social metric is a categorical variable rather than a discrete or continuous variable. Typically Pearons correlation is only used for discrete and continuous variables, not categorical variables, however as there are 11 categories of social metric we can still get a good distribution, enough to see that hours watched may be affected by the participants social metric. Lets perform a Two Way ANOVA to test if there is a statistically significant interaction between group and social metric.

Two way ANOVA

anova2 <- aov(hours_watched ~ group + social_metric, data=df)
summary(anova2)

##                Df Sum Sq Mean Sq F value   Pr(>F)    
## group           1   23.8   23.80   14.31 0.000165 ***
## social_metric   1   92.9   92.87   55.82 1.73e-13 ***
## Residuals     997 1658.8    1.66                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results of the two way ANOVA (P-value: <0.001) indicate a non-statistically-significant difference between variances of groups. Meaning that there is no statistically significant interaction between group and social metric.

Now lets finish evaluating the numerical variables.

#Plot 10
#Make chart to view the distribution, regression and pearsons correlation between numerical categories
chart.Correlation(df[,c(3:5,8,9)], histogram=TRUE, pch=19)

Figure 10 Findings:

We can see a number of important things from figure 10.

The pearsons correlation co-efficient of age and hours watched is -0.57, indicating a moderate to strong negative relationship, as we previously mentioned.
The pearsons correlation co-efficient of date and hours watched is only 0.085, indicating there is almost no relationship.
Hours watched is virtually a perfect normal distribution
We’ve already touched on hours watched by social metric
We’ve already touched on the distribution of participants ages
Date number (the date) is virtually perfectly evenly distributed, lets just quickly confirm that:

table(df$DateNumber)

## 
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
## 32 32 32 33 32 32 32 33 32 32 32 33 32 32 32 33 32 32 32 33 32 32 32 33 32 32 
## 27 28 29 30 31 
## 32 33 32 32 33

Now lets make a heat map of Chi-Square Test p-values:

# Plot Heatmap
numCat.Vars = 5 # How many categorical variables in the dataset
idxs = c(1,2,4,6,7)
colnames(df)[idxs]

## [1] "date"          "gender"        "social_metric" "demographic"  
## [5] "group"

Mtx = matrix(0, nrow = numCat.Vars, ncol= numCat.Vars)
colnames(Mtx) = colnames(df)[idxs]
rownames(Mtx) =  colnames(df)[idxs]

# Create matrix of p values
# Dont worry about the warnings
for(i in 1:numCat.Vars){
  x = df[,idxs[i]] 
  for(j in 1:numCat.Vars){
    y = df[,idxs[j]]
    # Perform Chi square stat and get the p-value
    Mtx[i,j] = chisq.test( table( data.frame(x,y)  )  )$p.value
  }
}
# 

#Heat map of Chi-square Test p-values, Figure 11
heatmap.2(Mtx, scale ="none", trace ="none", 
          col = brewer.pal(n=8, name="RdYlBu"), #Customize colours, add title, adjust cex for row and column
          cexRow=0.75, cexCol = 0.75) 
          title(main = "Heatmap of Chi-square Test p-values",cex.main=0.95)

Figure 11 Findings:

This heat map shows the results of the Chi-Square Test of Independence, which determines whether there is an association between categorical variables. The red squares in this heat map indicate there is a relationship between two categorical variables.

As alluded to earlier, there is a relationship between date and what group the participant was in (group A or group B)
Similarly there is a relationship between gender and date
There is a relationship between gender and demographic
There is no relationship between group and social metric

Data Analytics Project 2

Chris Berry

03/02/2021