The following chunk will load the pre-cleaned Stream data.
# load stream dataset
<img src="A321222v3_files/figure-html/Scatter matrix and correlation-1.png" width="672" />
# Checking assumptions of Linear Models
Plot the fit over the data
gg <- ggplot()
gg <- gg + geom_point(aes(x = x, y = y, colour ='hours_watched'))
gg <- gg + geom_line(aes(x = xfit1, y = yfit1), colour = 'black')
gg <- gg + labs(x = 'age', y = 'hours_watched')
gg
# check normal distribution of the error term
<img src="A321222v3_files/figure-html/unnamed-chunk-5-1.png" width="672" /><img src="A321222v3_files/figure-html/unnamed-chunk-5-2.png" width="672" />
# The QQPlot of the residual of hours_watched (Actual minus Expected) shows that the residual is normally distributed as per assumption.
```{r-check historgram of residual}
gg <- ggplot() gg <- gg + geom_histogram(aes(x = A$e1), bins = 10, colour = “black”) gg <- gg + labs(title = “Residual1 of hours_watched (Actual minus Expected)”, x = “Residual1”) gg
# The histogram confirms that the residual of hours_watched (Actual minus Expected)is normally distributed.
# The independence assumption is satisfied when the high correlation (0.76) between age and demographic was recognised in the scatter matrix and age was chosen in the MR and the mulitcollinearity issue is reduced.
##
## Call:
## lm(formula = y2 ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7544 -0.8255 0.0273 0.8349 3.7632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.87785 0.08295 46.747 < 2e-16 ***
## x2 0.09414 0.01449 6.495 1.39e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.294 on 878 degrees of freedom
## Multiple R-squared: 0.04585, Adjusted R-squared: 0.04476
## F-statistic: 42.19 on 1 and 878 DF, p-value: 1.386e-10
Plot the fit over the data
gg <- ggplot() gg <- gg + geom_point(aes(x = x2, y = y2, colour=‘hours_watched’)) gg <- gg + geom_line(aes(x = xfit2, y = yfit2), colour = ‘black’) gg <- gg + labs(x = ‘social_Metric’, y = ‘hours_watched’) gg
# The hours_watched vs social_metric gragh of the actual observations and the fitted line (fit1) shows that the errors are evenly distributed around the fitted line. Thus the assumption of constant variance is satisfied.
# The QQPlot of the residual of hours_watched (Actual minus Expected) shows that the residual is normally distributed as per assumption.
```{r-check historgram of residual}
gg <- ggplot()
gg <- gg + geom_histogram(aes(x = A$e2), bins = 10, colour = "black")
gg <- gg + labs(title = "Residual2 of hours_watched (Actual minus Expected)", x = "Residual2")
gg
# Minimum Sample size
# Check age distribution
Smaller percentages increase the size of the treatment group, but decrease the potential number of groups to investigate later.
As an example a 10% allocation to the control group is shown below.
## [1] "Number of groups for females: 1"
A clear split in this data is to when the hours_watched by 7.
A midway age split for the cluster above the 7-hour line is about 28, while for the lower cluster it is about 45.
## [1] "Ademographic 1 % = 22.29 Ademographic 2 % = 28.92 Ademographic 3 % = 26.81 Ademographic 4 % = 21.99"
## [1] "Bdemographic 1 % = 10.83 Bdemographic 2 % = 26.67 Bdemographic 3 % = 13.33 Bdemographic 4 % = 49.17"
## Check group balances
# WNW setup the A/B groups, double check that the A/B group represent the total population.
The hypothesis tests will be done on each subgroup as groups A and B are biased samples that do not represent the whole population.```
## [1] 0.3361421
## [1] 0.1182893
## [1] -0.1009064
## [1] -0.1447543
##
## Welch Two Sample t-test
##
## data: AM2demo1$hours_watched and BM2demo1$hours_watched
## t = -2.4398, df = 22.641, p-value = 0.02296
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.00921595 -0.08263436
## sample estimates:
## mean of x mean of y
## 5.199459 5.745385
##
## Welch Two Sample t-test
##
## data: AM2demo2$hours_watched and BM2demo2$hours_watched
## t = -2.5703, df = 58.333, p-value = 0.01274
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.0583304 -0.1316696
## sample estimates:
## mean of x mean of y
## 5.11125 5.70625
##
## Welch Two Sample t-test
##
## data: AM2demo3$hours_watched and BM2demo3$hours_watched
## t = -1.8148, df = 19.896, p-value = 0.08466
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.42117216 0.09902328
## sample estimates:
## mean of x mean of y
## 3.559551 4.220625
##
## Welch Two Sample t-test
##
## data: AM2demo4$hours_watched and BM2demo4$hours_watched
## t = -3.0385, df = 116.3, p-value = 0.002936
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.9774378 -0.2060333
## sample estimates:
## mean of x mean of y
## 3.687671 4.279407
## Name Effect
## 1 demo1 0.5459252
## 2 demo2 0.5950000
## 3 demo3 0.6610744
## 4 demo4 0.5917355
## [1] "WA final effect of hours_watched is 0.5975"
$ social_metric : int 5 7 4 10 1 0 5 6 8 7 …