MT5762 Lecture 10

C. Donovan

Recap of lect 9

  • We've looked at confidence intervals for means, differences of means, proportions:

    • inference arises from the sampling distribution (the distribution of sample means)
    • it's variability is estimated via the SE, shape arises from the CLT
  • We construct the interval such that 95% of such intervals (from repeated samples) would capture the true parameter

Recap of lect 9

  • We've looked at hypothesis testing
    • we speculate about a sampling distribution for a parameter, based on some null hypothesis e.g. is zero on average
    • other properties of this distribution are estimated from our sample e.g. variability (using the SE), shape arise from the CLT
    • we see how “likely” our sample estimate is given this Null Hypothesis distribution

We can measure this by integrating relevant parts of the null distribution i.e. a probability beyond \( t_0 \) from our sample

Recap of lect 9

[My laptop has a stylus! practice needed]

  • Integrate beyond \( |t_0| \) to get the \( p \)-value (for a two-tailed test, else <\( t_0 \) or >\( t_0 \))
  • OR… your \( p \)-value level of significance implies a “rejection zone” e.g. 0.05 means \( |t_0|>1.96 \) rejects \( H_0 \) (for two-tailed tests, large \( n \))

Finding the \(p\)-value

  • We use the distribution of the test statistic to find the \( p \)-value
  • The \( p \)-value is the probability, calculated assuming the null hypothesis is true, that sampling variation alone would produce data that is more discrepant than what we observed
  • A small \( p \) means our data isn't likely under the Null hypothesis … it's not the generating process

A classical test sets a significance level a priori (e.g. 0.05). If our \( p \)-value is below this, we reject. Equivalently, if our \( t_0 \) is outside the region implied by the significance threshold, reject \( H_0 \).

Finding the \(p\)-value

Moving away from significance thresholds, the \( p \)-value might be considered the strength of evidence against the null hypothesis

Approximate \( P \)-value Translation
\( >0.12 \) No evidence against \( H_0 \)}
\( 0.10 \) Weak evidence against \( H_0 \)
\( 0.05 \) Some evidence against \( H_0 \)}
\( 0.01 \) Strong evidence against \( H_0 \)
\( \leq 0.001 \) Very strong evidence against \( H_0 \)}

Interpretation of \( p \)-values – Wild and Seber, p379.

The \(p\)-value

  • Whiles this is a key statistical tool for inference, it attracts criticism
  • The criticism really ought to be focussed on its use - slavish adherence signifance levels is not good
  • Read (moodle): Leek & Peng on Moodle as well as Baker.
  • Read parts of the RSS special issue on “the S-word” (link on moodle)

Which brings us to….

Practical significance versus statistical significance

  • The word 'significant' can often cause confusion – for hypothesis tests a significant result commonly means the \( p \)-value is less than 0.05 not that the result is substantial and has practical significance.
  • Indeed we define significance as part of our test e.g. 0.05, so another analysis might set this differently
  • Statistical significance relates to the existence of an effect
  • Practical significance relates to the size of an effect

Practical significance versus statistical significance

For example, the World Health Organisation now lists “processed meats” as a group 1 carcinogen (http://www.who.int/features/qa/cancer-red-meat/en/)

  • The study was huge and the results are highly statistically significant
  • However, what does it mean in practical terms - statistical significance is one thing, but the effect may be small
  • One figure was an 18% increase in bowel cancer risk (but how likely was that anyway? 18% of a small number is not scary)
  • You might simply view the cancer risk to have moved from about 5% to about 6% by eating these modestly - so, your call

Practical significance versus statistical significance

This is sausage

This is cancer

A study aid, for remembering statistical vs practical significance. Partial plug for vegetarianism.

Practical significance versus statistical significance

  • By happy conincidence, this just popped up in the press again yesterday (30/09/19)
  • This highlights the difference in statistical and practical significance some more

Controversy comes back

Comparing means - classic \(t\)-tests

We've looked at the machinery, however \( t \)-tests occur a lot

These are based on a simple model, and there is a standard process followed (similar to later models)

The \(t\)-test cookbook

Identify a possible \( t \)-test when comparing:

  • means between two populations (represented by 2 samples) - is the true difference zero?
  • the mean of a population (represented by a sample) to a hypothetical value,
  • if paired, the population of paired differences (represented by the sample) - are these on average zero?

Test

  • Generate hypotheses \( H_0 \) and \( H_A \). How likely is our data under \( H_0 \)?
  • Generate \( t \)-test statistic (under \( H_0 \)) - this gives \( p \)-value under \( H_0 \).

The \(t\)-test cookbook

Conclude

  • Is \( p \)-value<\( \alpha \)? (often 0.05, note \( \alpha \) implies a cutoff'/critical value of the test statistic):
  • if so, reject \( H_0 \) as implausible.
  • if not, fail to reject \( H_0 \) because it is plausible.
  • Check assumptions: normality, independence, homogeneity of variances (for 2 independent samples)

Assumptions

You've done the test - it assumes things for validity. Mainly:

  • Normality (of errors/noise): examine QQ-Norm plots, shapiro-wilks test etc
  • Independence (of errors/noise): difficult to assess - examine the design
  • Homogeneity of variances (of errors/noise): rule-of-thumb, formal tests

These assumptions are common to other models (we're in the class of linear models: week 4)

Recall the baby-weight comparison

Wanting to compare the birthweights of the babies from smoking/non-smoking mothers

  head(babyData)
  bwt gestation parity age height weight smoke
1 120       284      0  27     62    100     0
2 113       282      0  33     64    135     0
3 128       279      0  28     64    115     1
4 123       999      0  36     69    190     0
5 108       282      0  23     67    125     1
6 136       286      0  25     62     93     0

Recall the baby-weight comparison

Again - the point estimates

  babyData %>% filter(smoke!=9) %>% group_by(smoke) %>% summarise(mean = mean(bwt), SD = sd(bwt), n = n())
# A tibble: 2 x 4
  smoke  mean    SD     n
  <int> <dbl> <dbl> <int>
1     0  123.  17.4   742
2     1  114.  18.1   484

NB from previous description, “Smoking status of mother: 0=not now, 1=yes now, 9=unknown”

Recall the baby-weight comparison

  smokingMothers <- babyData %>% filter(smoke != 9) %>% mutate(smoke = factor(smoke))

  p <- ggplot(data = smokingMothers) + geom_histogram(aes(bwt, ..density.., fill = smoke), alpha = 0.5) 

  p

plot of chunk unnamed-chunk-5

Recall the baby-weight comparison

Now we would go directly to the test

  t.test(bwt ~ smoke, data = smokingMothers)

    Welch Two Sample t-test

data:  bwt by smoke
t = 8.5813, df = 1003.2, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  6.89385 10.98148
sample estimates:
mean in group 0 mean in group 1 
       123.0472        114.1095 

Recall the baby-weight comparison

We conclude there is a significant difference between the groups in terms of underlying mean birth-weights:

  • \( p \)-value is very small (<2.2e-16), well below any reasonable threshold
  • these samples are very different to those implied by \( H_0 \)
  • the range of “plausible values” for the difference is (95% CI): (6.89, 10.98) - so 0/no difference is not plausible

How're the assumptions?

Assumptions - Normality

First a useful qualitative view - the QQ-norm plot

 # whats the noise look like? i.e. what's left after the signal is extracted?
 # The 'signal' is just group means - doin' it ugly:
 nonsmokers <- smokingMothers %>% filter(smoke == 0) %>% mutate(noise = bwt - mean(bwt)) 
 smokers <- smokingMothers %>% filter(smoke == 1) %>% mutate(noise = bwt - mean(bwt))  
 allNoise <- c(smokers$noise, nonsmokers$noise)
 qqnorm(allNoise)

Assumptions - Normality

First a useful qualitative view - the QQ-norm plot

plot of chunk unnamed-chunk-8

  • If it is straight, then the data (here the noise) are quite Normal
  • This looks great actually. Normality of noise - tick.

Assumptions - Normality

A formal test of the same

  shapiro.test(allNoise)

    Shapiro-Wilk normality test

data:  allNoise
W = 0.99438, p-value = 0.0001468
  • It's a test - so an \( H_0 \) is in play
  • \( H_0 \) - the data are Normal
  • \( p \)-value is tiny (0.00015) - reject \( H_0 \) (sad-face?)
  • Not uncommon with lots of data (power to detect small departures is high) - the QQ-Norm indicates pretty Normal, so I'm happy

Assumptions - Independence

This is difficult to assess, particularly here.

  • We could, say, look at the noise in order of collection. Are there patterns
  • However, we'll rely on sensible data collection and assume this is OK here

Assumptions - equal group variances

Homogeneity of variances (of errors/noise)

  • Rule of thumb - their SDs don't differ by more than a factor of 2. we're good - very similar
  # could use the calc the groups bwt SD:
  smokingMothers %>% select(bwt, smoke) %>% group_by(smoke) %>% summarise(sd = sd(bwt))
# A tibble: 2 x 2
  smoke    sd
  <fct> <dbl>
1 0      17.4
2 1      18.1
  # but we had these to hand already
  sd(smokers$noise)/sd(nonsmokers$noise)
[1] 1.040248
  • Formal test: We're good - \( H_0 \) variances equal, fail to reject.
  var.test(smokers$noise, nonsmokers$noise)

    F test to compare two variances

data:  smokers$noise and nonsmokers$noise
F = 1.0821, num df = 483, denom df = 741, p-value = 0.336
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.9213317 1.2745363
sample estimates:
ratio of variances 
          1.082115 

Assumptions

This is not an esoteric list of requirements

  • They are simply checks that the model for noise is correct
  • We assume the noise are independent draws from a single Normal distribution, hence
    • A single variance
    • Independence
    • Normal in shape

These are the same assumptions for many models, because that model for noise is common.

Violations of assumption?

So, we're happy. However, if there are marked violations:

  • Normality: transformations, non-parametric variants (e.g. Wilcoxon tests), more appropriate model? e.g. different GLM.
  • Independence: if paired, easy; else models that account for dependencies e.g. mixed models, GEEs (see MT5757).
  • Homogeneity of variances: transformations, robust variants, more appropriate model? e.g. different GLM.

Usually best to choose a non-parametric variant, or just a more appropriate model (e.g. GLMs)

Cook-book \(t\)-test - continued

  • We've looked at two independent samples
  • However, data are often paired e.g. before and after treatment on the same subject
    • This is a clear violation of independence - the noise would likely have clear dependencies
    • However, if we repose as differences within subject, all is well - it becomes a single sample \( t \)-test (on differences)

Precision in comparing proportions

SEs for proportions we clearly need SEs for our CIs and tests.

  • Sometimes they are not so obvious - the standard error we use depends on how we obtained our data.
  • Three types of sampling situations are considered here (common in consumer survey data).

Proportions from independent samples

For example: A random sample of 1000 people born in New Zealand and 1000 people born in Scotland. Note: A respondent can't belong to both populations.

\[ se(\hat{p}_1-\hat{p}_2)=\sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1}+\frac{\hat{p}_2(1-\hat{p}_2)}{n_2}} \]

One sample of size \(n\), several response categories

For example: A random sample of Scots are asked who they are going to vote for in the next election. Note: Respondents slot into ONE category and the proportions add to 1.

\[ se(\hat{p}_1-\hat{p}_2)=\sqrt{\frac{\hat{p}_1+\hat{p}_2-(\hat{p}_1-\hat{p}_2)^2}{n}} \]

One sample of size \(n\), many "Yes/No" items

For example: A random sample of Scots are asked:

  • Do you watch rugby?
  • Do you like beer?
  • Do you like licorice?

Note: Respondents can slot into MORE THAN ONE category.

\[ se(\hat{p}_1-\hat{p}_2)=\sqrt{\frac{Min(\hat{p}_1+\hat{p}_2, \hat{q}_1 +\hat{q}_2)-(\hat{p}_1-\hat{p}_2)^2}{n}} \]

Recap - normal approximation for proportions

(This is the table I failed to find previously)

In short, estimating small or large \( p \) puts us near the 0/1 boundary, so we need larger samples:

  • this will create a “tighter” sampling distribution, less affected by the boundary
  • relatedly, allow the sampling distribution a predictable Normal shape
Value for \( \hat{p} \) 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Minimum \( n \) 960 400 220 125 76 47 23 13 11 10
Value for \( \hat{p} \) 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5

Better yet - use a GLM, like you will encounter in MT5761.

Type I error

The type 1 error was mentioned - we elaborate a bit here

This is related to our desired level of significance and width of confidence intervals

[there is a type 2 error, later]

Type I error

  • Our cutoff value (\( \alpha \)) specifies a priori our accepted chance of incorrectly rejecting \( H_0 \).
  • If 0.05, then anticipate rejecting \( H_0 \) on the basis of our sample 1/20 times when it true
  • A two-tailed test apportions this chance of being wrong equally above and below \( H_0 \)
  • we are equally likely to be wrong due to extreme samples above or below \( H_0 \)

Type I error

  • A one tailed test apportions this chance of being wrong to either above or below \( H_0 \)
  • We are saying it is impossible that the sampled population has a \( \mu \) either less than or greater than specified under \( H_0 \).
  • A one tailed test gives a lower test statistic threshold (but in only one direction).
  • A one tailed test only accepts deviations from \( H_0 \) in one direction as evidence against \( H_0 \).

Recap and look-forwards

We've covered:

  • Practical vs statistical significance
  • Various types of \( t \)-tests
  • Assumptions associated and how to check them
  • Care in SEs for proportions
  • Type 1 error

Next:

  • Comparing many means simultaneously - ANOVA
  • (hence, the \( F \) distribution)
  • Post-hoc comparisons