MT5762 Lecture 10

C. Donovan

Recap of lect 9

We've looked at confidence intervals for means, differences of means, proportions:
- inference arises from the sampling distribution (the distribution of sample means)
- it's variability is estimated via the SE, shape arises from the CLT
We construct the interval such that 95% of such intervals (from repeated samples) would capture the true parameter

Recap of lect 9

We've looked at hypothesis testing
- we speculate about a sampling distribution for a parameter, based on some null hypothesis e.g. is zero on average
- other properties of this distribution are estimated from our sample e.g. variability (using the SE), shape arise from the CLT
- we see how “likely” our sample estimate is given this Null Hypothesis distribution

We can measure this by integrating relevant parts of the null distribution i.e. a probability beyond \( t_0 \) from our sample

Recap of lect 9

[My laptop has a stylus! practice needed]

Integrate beyond \( |t_0| \) to get the \( p \)-value (for a two-tailed test, else <\( t_0 \) or >\( t_0 \))
OR… your \( p \)-value level of significance implies a “rejection zone” e.g. 0.05 means \( |t_0|>1.96 \) rejects \( H_0 \) (for two-tailed tests, large \( n \))

Finding the \(p\)-value

We use the distribution of the test statistic to find the \( p \)-value
The \( p \)-value is the probability, calculated assuming the null hypothesis is true, that sampling variation alone would produce data that is more discrepant than what we observed
A small \( p \) means our data isn't likely under the Null hypothesis … it's not the generating process

A classical test sets a significance level a priori (e.g. 0.05). If our \( p \)-value is below this, we reject. Equivalently, if our \( t_0 \) is outside the region implied by the significance threshold, reject \( H_0 \).

Finding the \(p\)-value

Moving away from significance thresholds, the \( p \)-value might be considered the strength of evidence against the null hypothesis

Approximate \( P \)-value	Translation
\( >0.12 \)	No evidence against \( H_0 \)}
\( 0.10 \)	Weak evidence against \( H_0 \)
\( 0.05 \)	Some evidence against \( H_0 \)}
\( 0.01 \)	Strong evidence against \( H_0 \)
\( \leq 0.001 \)	Very strong evidence against \( H_0 \)}

Interpretation of \( p \)-values – Wild and Seber, p379.

The \(p\)-value

Whiles this is a key statistical tool for inference, it attracts criticism
The criticism really ought to be focussed on its use - slavish adherence signifance levels is not good
Read (moodle): Leek & Peng on Moodle as well as Baker.
Read parts of the RSS special issue on “the S-word” (link on moodle)

Which brings us to….

Practical significance versus statistical significance

The word 'significant' can often cause confusion – for hypothesis tests a significant result commonly means the \( p \)-value is less than 0.05 not that the result is substantial and has practical significance.
Indeed we define significance as part of our test e.g. 0.05, so another analysis might set this differently
Statistical significance relates to the existence of an effect
Practical significance relates to the size of an effect

Practical significance versus statistical significance

For example, the World Health Organisation now lists “processed meats” as a group 1 carcinogen (http://www.who.int/features/qa/cancer-red-meat/en/)

The study was huge and the results are highly statistically significant
However, what does it mean in practical terms - statistical significance is one thing, but the effect may be small
One figure was an 18% increase in bowel cancer risk (but how likely was that anyway? 18% of a small number is not scary)
You might simply view the cancer risk to have moved from about 5% to about 6% by eating these modestly - so, your call

Practical significance versus statistical significance

This is sausage

This is cancer

A study aid, for remembering statistical vs practical significance. Partial plug for vegetarianism.

Practical significance versus statistical significance

By happy conincidence, this just popped up in the press again yesterday (30/09/19)
This highlights the difference in statistical and practical significance some more

Comparing means - classic \(t\)-tests

We've looked at the machinery, however \( t \)-tests occur a lot

These are based on a simple model, and there is a standard process followed (similar to later models)

The \(t\)-test cookbook

Identify a possible \( t \)-test when comparing:

means between two populations (represented by 2 samples) - is the true difference zero?
the mean of a population (represented by a sample) to a hypothetical value,
if paired, the population of paired differences (represented by the sample) - are these on average zero?

Test

Generate hypotheses \( H_0 \) and \( H_A \). How likely is our data under \( H_0 \)?
Generate \( t \)-test statistic (under \( H_0 \)) - this gives \( p \)-value under \( H_0 \).

The \(t\)-test cookbook

Conclude

Is \( p \)-value<\( \alpha \)? (often 0.05, note \( \alpha \) implies a cutoff'/critical value of the test statistic):
if so, reject \( H_0 \) as implausible.
if not, fail to reject \( H_0 \) because it is plausible.
Check assumptions: normality, independence, homogeneity of variances (for 2 independent samples)

Assumptions

You've done the test - it assumes things for validity. Mainly:

Normality (of errors/noise): examine QQ-Norm plots, shapiro-wilks test etc
Independence (of errors/noise): difficult to assess - examine the design
Homogeneity of variances (of errors/noise): rule-of-thumb, formal tests

These assumptions are common to other models (we're in the class of linear models: week 4)

Recall the baby-weight comparison

Wanting to compare the birthweights of the babies from smoking/non-smoking mothers

  head(babyData)

  bwt gestation parity age height weight smoke
1 120       284      0  27     62    100     0
2 113       282      0  33     64    135     0
3 128       279      0  28     64    115     1
4 123       999      0  36     69    190     0
5 108       282      0  23     67    125     1
6 136       286      0  25     62     93     0

Recall the baby-weight comparison

Again - the point estimates

  babyData %>% filter(smoke!=9) %>% group_by(smoke) %>% summarise(mean = mean(bwt), SD = sd(bwt), n = n())

# A tibble: 2 x 4
  smoke  mean    SD     n
  <int> <dbl> <dbl> <int>
1     0  123.  17.4   742
2     1  114.  18.1   484

NB from previous description, “Smoking status of mother: 0=not now, 1=yes now, 9=unknown”

Recall the baby-weight comparison

  smokingMothers <- babyData %>% filter(smoke != 9) %>% mutate(smoke = factor(smoke))

  p <- ggplot(data = smokingMothers) + geom_histogram(aes(bwt, ..density.., fill = smoke), alpha = 0.5) 

  p

plot of chunk unnamed-chunk-5

Recall the baby-weight comparison

Now we would go directly to the test

  t.test(bwt ~ smoke, data = smokingMothers)


    Welch Two Sample t-test

data:  bwt by smoke
t = 8.5813, df = 1003.2, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  6.89385 10.98148
sample estimates:
mean in group 0 mean in group 1 
       123.0472        114.1095

Recall the baby-weight comparison

We conclude there is a significant difference between the groups in terms of underlying mean birth-weights:

\( p \)-value is very small (<2.2e-16), well below any reasonable threshold
these samples are very different to those implied by \( H_0 \)
the range of “plausible values” for the difference is (95% CI): (6.89, 10.98) - so 0/no difference is not plausible

How're the assumptions?

Assumptions - Normality

First a useful qualitative view - the QQ-norm plot

 # whats the noise look like? i.e. what's left after the signal is extracted?
 # The 'signal' is just group means - doin' it ugly:
 nonsmokers <- smokingMothers %>% filter(smoke == 0) %>% mutate(noise = bwt - mean(bwt)) 
 smokers <- smokingMothers %>% filter(smoke == 1) %>% mutate(noise = bwt - mean(bwt))  
 allNoise <- c(smokers$noise, nonsmokers$noise)
 qqnorm(allNoise)

Assumptions - Normality

First a useful qualitative view - the QQ-norm plot

plot of chunk unnamed-chunk-8

If it is straight, then the data (here the noise) are quite Normal
This looks great actually. Normality of noise - tick.

Assumptions - Normality

A formal test of the same

  shapiro.test(allNoise)


    Shapiro-Wilk normality test

data:  allNoise
W = 0.99438, p-value = 0.0001468

It's a test - so an \( H_0 \) is in play
\( H_0 \) - the data are Normal
\( p \)-value is tiny (0.00015) - reject \( H_0 \) (sad-face?)
Not uncommon with lots of data (power to detect small departures is high) - the QQ-Norm indicates pretty Normal, so I'm happy

Assumptions - Independence

This is difficult to assess, particularly here.

We could, say, look at the noise in order of collection. Are there patterns
However, we'll rely on sensible data collection and assume this is OK here

Assumptions - equal group variances

Homogeneity of variances (of errors/noise)

Rule of thumb - their SDs don't differ by more than a factor of 2. we're good - very similar

  # could use the calc the groups bwt SD:
  smokingMothers %>% select(bwt, smoke) %>% group_by(smoke) %>% summarise(sd = sd(bwt))

# A tibble: 2 x 2
  smoke    sd
  <fct> <dbl>
1 0      17.4
2 1      18.1

  # but we had these to hand already
  sd(smokers$noise)/sd(nonsmokers$noise)

[1] 1.040248

Formal test: We're good - \( H_0 \) variances equal, fail to reject.

  var.test(smokers$noise, nonsmokers$noise)


    F test to compare two variances

data:  smokers$noise and nonsmokers$noise
F = 1.0821, num df = 483, denom df = 741, p-value = 0.336
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.9213317 1.2745363
sample estimates:
ratio of variances 
          1.082115

Assumptions

This is not an esoteric list of requirements

They are simply checks that the model for noise is correct
We assume the noise are independent draws from a single Normal distribution, hence
- A single variance
- Independence
- Normal in shape

These are the same assumptions for many models, because that model for noise is common.

Violations of assumption?

So, we're happy. However, if there are marked violations:

Normality: transformations, non-parametric variants (e.g. Wilcoxon tests), more appropriate model? e.g. different GLM.
Independence: if paired, easy; else models that account for dependencies e.g. mixed models, GEEs (see MT5757).
Homogeneity of variances: transformations, robust variants, more appropriate model? e.g. different GLM.

Usually best to choose a non-parametric variant, or just a more appropriate model (e.g. GLMs)

Cook-book \(t\)-test - continued

We've looked at two independent samples
However, data are often paired e.g. before and after treatment on the same subject
- This is a clear violation of independence - the noise would likely have clear dependencies
- However, if we repose as differences within subject, all is well - it becomes a single sample \( t \)-test (on differences)

Precision in comparing proportions

SEs for proportions we clearly need SEs for our CIs and tests.

Sometimes they are not so obvious - the standard error we use depends on how we obtained our data.
Three types of sampling situations are considered here (common in consumer survey data).

Proportions from independent samples

For example: A random sample of 1000 people born in New Zealand and 1000 people born in Scotland. Note: A respondent can't belong to both populations.

\[ se(\hat{p}_1-\hat{p}_2)=\sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1}+\frac{\hat{p}_2(1-\hat{p}_2)}{n_2}} \]

One sample of size \(n\), several response categories

For example: A random sample of Scots are asked who they are going to vote for in the next election. Note: Respondents slot into ONE category and the proportions add to 1.

\[ se(\hat{p}_1-\hat{p}_2)=\sqrt{\frac{\hat{p}_1+\hat{p}_2-(\hat{p}_1-\hat{p}_2)^2}{n}} \]

One sample of size \(n\), many "Yes/No" items

For example: A random sample of Scots are asked:

Do you watch rugby?
Do you like beer?
Do you like licorice?

Note: Respondents can slot into MORE THAN ONE category.

\[ se(\hat{p}_1-\hat{p}_2)=\sqrt{\frac{Min(\hat{p}_1+\hat{p}_2, \hat{q}_1 +\hat{q}_2)-(\hat{p}_1-\hat{p}_2)^2}{n}} \]

Recap - normal approximation for proportions

(This is the table I failed to find previously)

In short, estimating small or large \( p \) puts us near the 0/1 boundary, so we need larger samples:

this will create a “tighter” sampling distribution, less affected by the boundary
relatedly, allow the sampling distribution a predictable Normal shape

Value for \( \hat{p} \)	0.05	0.1	0.15	0.2	0.25	0.3	0.35	0.4	0.45	0.5
Minimum \( n \)	960	400	220	125	76	47	23	13	11	10
Value for \( \hat{p} \)	0.95	0.9	0.85	0.8	0.75	0.7	0.65	0.6	0.55	0.5

Better yet - use a GLM, like you will encounter in MT5761.

Type I error

The type 1 error was mentioned - we elaborate a bit here

This is related to our desired level of significance and width of confidence intervals

[there is a type 2 error, later]

Type I error

Our cutoff value (\( \alpha \)) specifies a priori our accepted chance of incorrectly rejecting \( H_0 \).
If 0.05, then anticipate rejecting \( H_0 \) on the basis of our sample 1/20 times when it true
A two-tailed test apportions this chance of being wrong equally above and below \( H_0 \)
we are equally likely to be wrong due to extreme samples above or below \( H_0 \)

Type I error

A one tailed test apportions this chance of being wrong to either above or below \( H_0 \)
We are saying it is impossible that the sampled population has a \( \mu \) either less than or greater than specified under \( H_0 \).
A one tailed test gives a lower test statistic threshold (but in only one direction).
A one tailed test only accepts deviations from \( H_0 \) in one direction as evidence against \( H_0 \).

Recap and look-forwards

We've covered:

Practical vs statistical significance
Various types of \( t \)-tests
Assumptions associated and how to check them
Care in SEs for proportions
Type 1 error

Comparing many means simultaneously - ANOVA
(hence, the \( F \) distribution)
Post-hoc comparisons