C. Donovan
We've looked at confidence intervals for means, differences of means, proportions:
We construct the interval such that 95% of such intervals (from repeated samples) would capture the true parameter
We can measure this by integrating relevant parts of the null distribution i.e. a probability beyond \( t_0 \) from our sample
[My laptop has a stylus! practice needed]
A classical test sets a significance level a priori (e.g. 0.05). If our \( p \)-value is below this, we reject. Equivalently, if our \( t_0 \) is outside the region implied by the significance threshold, reject \( H_0 \).
Moving away from significance thresholds, the \( p \)-value might be considered the strength of evidence against the null hypothesis
Approximate \( P \)-value | Translation |
---|---|
\( >0.12 \) | No evidence against \( H_0 \)} |
\( 0.10 \) | Weak evidence against \( H_0 \) |
\( 0.05 \) | Some evidence against \( H_0 \)} |
\( 0.01 \) | Strong evidence against \( H_0 \) |
\( \leq 0.001 \) | Very strong evidence against \( H_0 \)} |
Interpretation of \( p \)-values – Wild and Seber, p379.
Which brings us to….
For example, the World Health Organisation now lists “processed meats” as a group 1 carcinogen (http://www.who.int/features/qa/cancer-red-meat/en/)
This is sausage
This is cancer
A study aid, for remembering statistical vs practical significance. Partial plug for vegetarianism.
We've looked at the machinery, however \( t \)-tests occur a lot
These are based on a simple model, and there is a standard process followed (similar to later models)
Identify a possible \( t \)-test when comparing:
Test
Conclude
You've done the test - it assumes things for validity. Mainly:
These assumptions are common to other models (we're in the class of linear models: week 4)
Wanting to compare the birthweights of the babies from smoking/non-smoking mothers
head(babyData)
bwt gestation parity age height weight smoke
1 120 284 0 27 62 100 0
2 113 282 0 33 64 135 0
3 128 279 0 28 64 115 1
4 123 999 0 36 69 190 0
5 108 282 0 23 67 125 1
6 136 286 0 25 62 93 0
Again - the point estimates
babyData %>% filter(smoke!=9) %>% group_by(smoke) %>% summarise(mean = mean(bwt), SD = sd(bwt), n = n())
# A tibble: 2 x 4
smoke mean SD n
<int> <dbl> <dbl> <int>
1 0 123. 17.4 742
2 1 114. 18.1 484
NB from previous description, “Smoking status of mother: 0=not now, 1=yes now, 9=unknown”
smokingMothers <- babyData %>% filter(smoke != 9) %>% mutate(smoke = factor(smoke))
p <- ggplot(data = smokingMothers) + geom_histogram(aes(bwt, ..density.., fill = smoke), alpha = 0.5)
p
Now we would go directly to the test
t.test(bwt ~ smoke, data = smokingMothers)
Welch Two Sample t-test
data: bwt by smoke
t = 8.5813, df = 1003.2, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
6.89385 10.98148
sample estimates:
mean in group 0 mean in group 1
123.0472 114.1095
We conclude there is a significant difference between the groups in terms of underlying mean birth-weights:
How're the assumptions?
First a useful qualitative view - the QQ-norm plot
# whats the noise look like? i.e. what's left after the signal is extracted?
# The 'signal' is just group means - doin' it ugly:
nonsmokers <- smokingMothers %>% filter(smoke == 0) %>% mutate(noise = bwt - mean(bwt))
smokers <- smokingMothers %>% filter(smoke == 1) %>% mutate(noise = bwt - mean(bwt))
allNoise <- c(smokers$noise, nonsmokers$noise)
qqnorm(allNoise)
First a useful qualitative view - the QQ-norm plot
A formal test of the same
shapiro.test(allNoise)
Shapiro-Wilk normality test
data: allNoise
W = 0.99438, p-value = 0.0001468
This is difficult to assess, particularly here.
Homogeneity of variances (of errors/noise)
# could use the calc the groups bwt SD:
smokingMothers %>% select(bwt, smoke) %>% group_by(smoke) %>% summarise(sd = sd(bwt))
# A tibble: 2 x 2
smoke sd
<fct> <dbl>
1 0 17.4
2 1 18.1
# but we had these to hand already
sd(smokers$noise)/sd(nonsmokers$noise)
[1] 1.040248
var.test(smokers$noise, nonsmokers$noise)
F test to compare two variances
data: smokers$noise and nonsmokers$noise
F = 1.0821, num df = 483, denom df = 741, p-value = 0.336
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.9213317 1.2745363
sample estimates:
ratio of variances
1.082115
This is not an esoteric list of requirements
These are the same assumptions for many models, because that model for noise is common.
So, we're happy. However, if there are marked violations:
Usually best to choose a non-parametric variant, or just a more appropriate model (e.g. GLMs)
SEs for proportions we clearly need SEs for our CIs and tests.
For example: A random sample of 1000 people born in New Zealand and 1000 people born in Scotland. Note: A respondent can't belong to both populations.
\[ se(\hat{p}_1-\hat{p}_2)=\sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1}+\frac{\hat{p}_2(1-\hat{p}_2)}{n_2}} \]
For example: A random sample of Scots are asked who they are going to vote for in the next election. Note: Respondents slot into ONE category and the proportions add to 1.
\[ se(\hat{p}_1-\hat{p}_2)=\sqrt{\frac{\hat{p}_1+\hat{p}_2-(\hat{p}_1-\hat{p}_2)^2}{n}} \]
For example: A random sample of Scots are asked:
Note: Respondents can slot into MORE THAN ONE category.
\[ se(\hat{p}_1-\hat{p}_2)=\sqrt{\frac{Min(\hat{p}_1+\hat{p}_2, \hat{q}_1 +\hat{q}_2)-(\hat{p}_1-\hat{p}_2)^2}{n}} \]
(This is the table I failed to find previously)
In short, estimating small or large \( p \) puts us near the 0/1 boundary, so we need larger samples:
Value for \( \hat{p} \) | 0.05 | 0.1 | 0.15 | 0.2 | 0.25 | 0.3 | 0.35 | 0.4 | 0.45 | 0.5 |
---|---|---|---|---|---|---|---|---|---|---|
Minimum \( n \) | 960 | 400 | 220 | 125 | 76 | 47 | 23 | 13 | 11 | 10 |
Value for \( \hat{p} \) | 0.95 | 0.9 | 0.85 | 0.8 | 0.75 | 0.7 | 0.65 | 0.6 | 0.55 | 0.5 |
Better yet - use a GLM, like you will encounter in MT5761.
The type 1 error was mentioned - we elaborate a bit here
This is related to our desired level of significance and width of confidence intervals
[there is a type 2 error, later]
We've covered:
Next: