Libraries

Load Streaming data

The following chunk will load the pre-cleaned Stream data.

# load stream dataset

‘data.frame’: 1000 obs. of 8 variables:

$ date : chr “01-Jul” “01-Jul” “01-Jul” “01-Jul” …

$ gender : num 0 0 0 1 1 1 0 1 1 1 …

$ age : int 28 32 39 52 25 51 53 42 41 20 …

$ social_metric : int 5 7 4 10 1 0 5 6 8 7 …

$ time_since_signup: num 19.3 11.5 4.3 9.5 19.5 22.6 4.2 8.5 16.9 23 …

$ demographic : int 1 1 3 4 2 4 3 4 4 2 …

$ group : chr “A” “A” “A” “A” …

$ hours_watched : num 4.08 2.99 5.74 4.13 4.68 3.4 3.07 2.77 2.24 5.39 …

date gender age social_metric

Length:1000 Min. :0.000 Min. :18.00 Min. : 0.000

Class :character 1st Qu.:0.000 1st Qu.:28.00 1st Qu.: 2.000

Mode :character Median :1.000 Median :36.00 Median : 5.000

Mean :0.571 Mean :36.49 Mean : 4.911

3rd Qu.:1.000 3rd Qu.:46.00 3rd Qu.: 8.000

Max. :1.000 Max. :55.00 Max. :10.000

time_since_signup demographic group hours_watched

Min. : 0.00 Min. :1.000 Length:1000 Min. :0.500

1st Qu.: 5.70 1st Qu.:2.000 Class :character 1st Qu.:3.530

Median :11.80 Median :3.000 Mode :character Median :4.415

Mean :11.97 Mean :2.603 Mean :4.393

3rd Qu.:18.70 3rd Qu.:4.000 3rd Qu.:5.322

Max. :24.00 Max. :4.000 Max. :8.300

[1] 0

[1] 0

‘data.frame’: 880 obs. of 6 variables:

$ gender : num 0 0 0 1 1 1 0 1 1 1 …

$ age : int 28 32 39 52 25 51 53 42 41 20 …

$ social_metric : int 5 7 4 10 1 0 5 6 8 7 …

$ time_since_signup: num 19.3 11.5 4.3 9.5 19.5 22.6 4.2 8.5 16.9 23 …

$ demographic : int 1 1 3 4 2 4 3 4 4 2 …

$ hours_watched : num 4.08 2.99 5.74 4.13 4.68 3.4 3.07 2.77 2.24 5.39 …


<img src="A321222v3_files/figure-html/Scatter matrix and correlation-1.png" width="672" />

Call:

lm(formula = hours_watched ~ age + social_metric, data = A)

Residuals:

Min 1Q Median 3Q Max

-3.6244 -0.6361 -0.0271 0.6988 2.8773

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 6.535941 0.137147 47.657 < 2e-16 ***

age -0.072279 0.003262 -22.157 < 2e-16 ***

social_metric 0.084869 0.011619 7.305 6.25e-13 ***

Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

Residual standard error: 1.037 on 877 degrees of freedom

Multiple R-squared: 0.3883, Adjusted R-squared: 0.3869

F-statistic: 278.3 on 2 and 877 DF, p-value: < 2.2e-16

# Checking assumptions of Linear Models

Call:

lm(formula = y ~ x)

Residuals:

Min 1Q Median 3Q Max

-3.5142 -0.7242 -0.0030 0.7474 3.0046

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 6.980111 0.126542 55.16 <2e-16 ***

x -0.073137 0.003356 -21.79 <2e-16 ***

Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

Residual standard error: 1.067 on 878 degrees of freedom

Multiple R-squared: 0.3511, Adjusted R-squared: 0.3503

F-statistic: 475 on 1 and 878 DF, p-value: < 2.2e-16




Plot the fit over the data


gg <- ggplot()
gg <- gg + geom_point(aes(x = x, y = y, colour ='hours_watched'))
gg <- gg + geom_line(aes(x = xfit1, y = yfit1), colour = 'black')
gg <- gg + labs(x = 'age', y = 'hours_watched')
gg

The hours_watched vs age gragh of the actual observations and the fitted line (fit1) shows that the errors are evenly distributed around the fitted line. Thus the assumption of constant variance is satisfied.

# check normal distribution of the error term
<img src="A321222v3_files/figure-html/unnamed-chunk-5-1.png" width="672" /><img src="A321222v3_files/figure-html/unnamed-chunk-5-2.png" width="672" />
# The QQPlot of the residual of hours_watched (Actual minus Expected) shows that the residual is normally distributed as per assumption.

```{r-check historgram of residual}

gg <- ggplot() gg <- gg + geom_histogram(aes(x = A$e1), bins = 10, colour = “black”) gg <- gg + labs(title = “Residual1 of hours_watched (Actual minus Expected)”, x = “Residual1”) gg

# The histogram confirms that the residual of hours_watched (Actual minus Expected)is normally distributed.

#  The independence assumption is satisfied when the high correlation (0.76) between age and demographic was recognised in the scatter matrix and age was chosen in the MR and the mulitcollinearity issue is reduced.
## 
## Call:
## lm(formula = y2 ~ x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7544 -0.8255  0.0273  0.8349  3.7632 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.87785    0.08295  46.747  < 2e-16 ***
## x2           0.09414    0.01449   6.495 1.39e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.294 on 878 degrees of freedom
## Multiple R-squared:  0.04585,    Adjusted R-squared:  0.04476 
## F-statistic: 42.19 on 1 and 878 DF,  p-value: 1.386e-10

Plot the fit over the data

gg <- ggplot() gg <- gg + geom_point(aes(x = x2, y = y2, colour=‘hours_watched’)) gg <- gg + geom_line(aes(x = xfit2, y = yfit2), colour = ‘black’) gg <- gg + labs(x = ‘social_Metric’, y = ‘hours_watched’) gg

# The hours_watched vs social_metric gragh of the actual observations and the fitted line (fit1) shows that the errors are evenly distributed around the fitted line. Thus the assumption of constant variance is satisfied.

# The QQPlot of the residual of hours_watched (Actual minus Expected) shows that the residual is normally distributed as per assumption.

```{r-check historgram of residual}

gg <- ggplot()
gg <- gg + geom_histogram(aes(x = A$e2), bins = 10, colour = "black") 
gg <- gg + labs(title = "Residual2 of hours_watched (Actual minus Expected)", x = "Residual2") 
gg

The histogram confirms that the residual of hours_watched (Actual minus Expected)is normally distributed.

The independence assumption is satisfied when the high correlation (0.76) between age and demographic was recognised in the scatter matrix and age was chosen in the MR and the mulitcollinearity issue is reduced.

date gender age social_metric time_since_signup demographic group

1 01-Jul 0 28 5 19.3 1 A

2 01-Jul 0 32 7 11.5 1 A

3 01-Jul 0 39 4 4.3 3 A

4 01-Jul 1 52 10 9.5 4 A

5 01-Jul 1 25 1 19.5 2 A

6 01-Jul 1 51 0 22.6 4 A

hours_watched

1 4.08

2 2.99

3 5.74

4 4.13

5 4.68

6 3.40

date gender age social_metric time_since_signup demographic group

995 31-Jul 0 23 3 5.1 1 A

996 31-Jul 1 47 8 14.1 4 B

997 31-Jul 1 42 1 20.2 4 A

998 31-Jul 1 28 2 3.0 2 A

999 31-Jul 1 25 2 9.1 2 A

1000 31-Jul 1 28 4 16.1 2 A

hours_watched

995 5.820

996 4.135

997 3.350

998 6.290

999 2.670

1000 3.790

date gender age social_metric time_since_signup demographic group

1 01-Jul 0 28 5 19.3 1 A

2 01-Jul 0 32 7 11.5 1 A

3 01-Jul 0 39 4 4.3 3 A

4 01-Jul 1 52 10 9.5 4 A

5 01-Jul 1 25 1 19.5 2 A

6 01-Jul 1 51 0 22.6 4 A

hours_watched

1 4.08

2 2.99

3 5.74

4 4.13

5 4.68

6 3.40

date gender age social_metric time_since_signup demographic group

543 17-Jul 0 42 0 9.4 3 A

544 17-Jul 1 30 9 17.1 2 A

545 17-Jul 0 25 8 7.0 1 A

546 17-Jul 0 22 4 20.1 1 A

547 17-Jul 1 55 9 19.4 4 A

548 17-Jul 1 32 5 4.4 2 A

hours_watched

543 2.98

544 5.89

545 5.92

546 3.99

547 3.22

548 6.30

[1] 4.296259

[1] 1.290134

date gender age social_metric time_since_signup demographic group

1 18-Jul 1 45 7 2.4 4 A

2 18-Jul 0 50 2 6.6 3 A

3 18-Jul 1 39 1 18.7 4 A

4 18-Jul 0 18 9 10.5 1 A

5 18-Jul 0 52 2 5.3 3 A

6 18-Jul 0 47 6 13.8 3 A

hours_watched

1 4.00

2 3.66

3 3.58

4 6.64

5 3.36

6 2.83

date gender age social_metric time_since_signup demographic group

327 31-Jul 0 26 9 7.5 1 A

328 31-Jul 0 23 3 5.1 1 A

329 31-Jul 1 42 1 20.2 4 A

330 31-Jul 1 28 2 3.0 2 A

331 31-Jul 1 25 2 9.1 2 A

332 31-Jul 1 28 4 16.1 2 A

hours_watched

327 6.03

328 5.82

329 3.35

330 6.29

331 2.67

332 3.79

[1] 4.401928

[1] 1.378099

date gender age social_metric time_since_signup demographic group

1 18-Jul 0 39 5 14.8 3 B

2 18-Jul 1 45 0 2.2 4 B

3 18-Jul 0 28 8 1.4 1 B

4 18-Jul 1 53 4 8.2 4 B

5 18-Jul 1 45 8 9.1 4 B

6 19-Jul 0 31 5 0.6 1 B

hours_watched

1 3.740

2 2.635

3 5.980

4 2.975

5 3.965

6 6.900

date gender age social_metric time_since_signup demographic group

115 31-Jul 0 33 3 20.0 1 B

116 31-Jul 1 36 4 10.0 4 B

117 31-Jul 1 23 4 11.9 2 B

118 31-Jul 1 34 2 8.7 2 B

119 31-Jul 1 41 2 20.4 4 B

120 31-Jul 1 47 8 14.1 4 B

hours_watched

115 5.540

116 2.165

117 7.610

118 3.730

119 1.525

120 4.135

[1] 4.810875

[1] 1.32919

# Minimum Sample size

[1] “Min sample size 41”

The dataset was checked previously and it was confirmed that it is clean and ready for use.

Check the streaming behaviour between older people and younger people for the whole dataset

gender

# Check age distribution

Effect of changing the percentage that will be set aside for the control group.

Smaller percentages increase the size of the treatment group, but decrease the potential number of groups to investigate later.

As an example a 10% allocation to the control group is shown below.

## [1] "Number of groups for females: 1"

A clear split in this data is to when the hours_watched by 7.

A midway age split for the cluster above the 7-hour line is about 28, while for the lower cluster it is about 45.

## [1] "Ademographic 1 % = 22.29  Ademographic 2 % = 28.92  Ademographic 3 % = 26.81  Ademographic 4 % = 21.99"
## [1] "Bdemographic 1 % = 10.83  Bdemographic 2 % = 26.67  Bdemographic 3 % = 13.33  Bdemographic 4 % = 49.17"
## Check group balances

# WNW setup the A/B groups, double check that the A/B group represent the total population.

The hypothesis tests will be done on each subgroup as groups A and B are biased samples that do not represent the whole population.```

## [1] 0.3361421
## [1] 0.1182893
## [1] -0.1009064
## [1] -0.1447543
## 
##  Welch Two Sample t-test
## 
## data:  AM2demo1$hours_watched and BM2demo1$hours_watched
## t = -2.4398, df = 22.641, p-value = 0.02296
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.00921595 -0.08263436
## sample estimates:
## mean of x mean of y 
##  5.199459  5.745385
## 
##  Welch Two Sample t-test
## 
## data:  AM2demo2$hours_watched and BM2demo2$hours_watched
## t = -2.5703, df = 58.333, p-value = 0.01274
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.0583304 -0.1316696
## sample estimates:
## mean of x mean of y 
##   5.11125   5.70625
## 
##  Welch Two Sample t-test
## 
## data:  AM2demo3$hours_watched and BM2demo3$hours_watched
## t = -1.8148, df = 19.896, p-value = 0.08466
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.42117216  0.09902328
## sample estimates:
## mean of x mean of y 
##  3.559551  4.220625
## 
##  Welch Two Sample t-test
## 
## data:  AM2demo4$hours_watched and BM2demo4$hours_watched
## t = -3.0385, df = 116.3, p-value = 0.002936
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.9774378 -0.2060333
## sample estimates:
## mean of x mean of y 
##  3.687671  4.279407

effect=difference of mean hours_watched from 18/7

##    Name    Effect
## 1 demo1 0.5459252
## 2 demo2 0.5950000
## 3 demo3 0.6610744
## 4 demo4 0.5917355
## [1] "WA final effect of hours_watched is 0.5975"