MT5762 Lecture 11

C. Donovan

Comparing means

Extending the \( t \)-tests to more groups

despair.com

Comparing three or more means

Titanium levels in Cannabis leaves

We want to investigate if the average Titanium level differs in cannabis leaves across the three soil types: Blockhouse Bay (bhb),Northland (nth) and Potting mix (pm).

Titanium levels in Cannabis leaves

  p <- ggplot(TiData) + geom_boxplot(aes(Group, Ti), fill = 'purple', alpha = 0.8)

  p

plot of chunk unnamed-chunk-3

Titanium levels in Cannabis leaves

could use \( t \)-tests to compare the means of: bhb with nth and bhb with pm and nth with pm.
becomes problematic when comparing many means
each time we perform a \( t \)-test we run the risk of drawing a false conclusion (a 5% chance of a false positive when testing at the 5% level of significance).

What do we want to test?

We want one test that compares the mean of each group with each of the others.
For this, we can use a one-way an analysis of variance (one-way ANOVA).
A one-way analysis since the focus is mean Titanium levels differing with one factor (soil type).

For an ANOVA:

The null hypothesis is that the means for all the groups are identical
The alternative hypothesis is that at least one mean is different from one of the others ie. differences exist between some of the true means.

ANOVA in a nutshell

[waves hands, uses board - attending lectures is useful]

ANOVA in a nutshell

so the testing process is similar to previous:

Set up hypotheses
calculate a test statistic
Determine how likely those test statistics are if \( H_0 \) is true
[By “likely” we mean of that magnitude or more extreme]

Is our data-estimate consistent with the null hypothesis?

Calculate the \( F \)-test statistic for our data, \( f_0 \), using the ratio of the variability between groups (\( s^2_B \)) and the variability within groups (\( s^2_W \)):

\[ f_0=\frac{s^2_B}{s^2_W} \]

Is our data-estimate consistent with the null hypothesis?

Variability between groups, \( {s^2_B} \), is a measure comparing the mean of each group, \( \bar{x}_{k} \), with the overall (or grand) mean (\( \bar{x}_{..} \)).
\( k \) represents the number of groups and \( n_i \) is the sample size for each group.

\[ s^2_B=\frac{\sum_{k=1}^K n_k(\bar{x}_{k}-\bar{x}_{..})^2}{k-1} \]

Is our data-estimate consistent with the null hypothesis?

variability within groups, \( {s^2_W} \), is a measure using the variance for each group, \( s^2_k \).
\( n_{tot} \) is the total number of observations across all groups.

\[ s^2_W=\frac{\sum_{k=1}^K (n_k -1)s^2_k}{n_{tot}-k} \]

The \( F \)-test statistic, \( f_0 \), tests \( H_0 \) by comparing the variability of the sample means (numerator) with the variability within the samples (denominator)

Is our data-estimate consistent with the null hypothesis?

If the variability between groups is large compared with the variability within groups then the observed \( f_0 \) statistic will be large.
Conversely, if the variability between groups is small compared with the variability within groups, then the \( f_0 \) statistic will be small.

Is our data-estimate consistent with the null hypothesis?

Evidence against \( H_0 \) is provided by values of \( f_0 \) which would be unusually large if \( H_0 \) were true

Finding the \(p\)-value

The \( F \)-test statistic has an \( F \)-distribution with \( df_1=k-1 \) and \( df_2=n_{tot}-k \).
\( k \) is the number of groups and \( n_{tot} \) is the total number of observations.

\( F \)-distributions take on a range of shapes..

Finding the \(p\)-value

  x <- seq(0, 5, length = 200)

  plot(x, df(x, df1 = 1, df2 = 1), lwd = 2, type ='l')

  lines(x, df(x, df1 = 10, df2 = 1), lwd = 2, col = 'blue')

  lines(x, df(x, df1 = 100, df2 = 100), lwd = 2, col = 'purple')

plot of chunk unnamed-chunk-5

Finding the \(p\)-value

The \( F \)-test statistic has an \( F \)-distribution with \( df_1=k-1 \) and \( df_2=n_{tot}-k \).
\( k \) is the number of groups and \( n_{tot} \) is the total number of observations.
obtain the \( p \)-value in the standard way - find the probability of getting a test statistic, \( f_0 \), at least as discrepant as \( f_0 \) when the null hypothesis is true
ie. we find \( Pr(F \geq f_0) \).

Finding the \(p\)-value

\[ s^2_B=\frac{\sum_{k=1}^K n_k(\bar{x}_{k}-\bar{x}_{..})^2}{k-1} \]

  gpSummary <- TiData %>% group_by(Group) %>% summarise(mean = mean(Ti), var = var(Ti), N = n())

  groupSS <- sum(gpSummary$N*(gpSummary$mean-mean(TiData$Ti))^2)

  mseBetween <- groupSS/(3-1)

  mseBetween

[1] 59.29798

Finding the \(p\)-value

\[ s^2_W=\frac{\sum_{k=1}^K (n_k -1)s^2_k}{n_{tot}-k} \]

  withinSS <- sum((gpSummary$N-1)*gpSummary$var)

  mseWithin <- withinSS/(nrow(TiData)-3)

  mseWithin

[1] 2.094315

Finding the \(p\)-value

  f_0 <- mseBetween/mseWithin

  pf(f_0, df1 = 2, df2 = nrow(TiData)-3, lower.tail = F)

[1] 1.426962e-08

The \(F\)-distribution

Of course, you'd not do it like that - simply:

  weedANOVA <- aov(Ti ~ Group, data = TiData)

  summary(weedANOVA)

            Df Sum Sq Mean Sq F value   Pr(>F)    
Group        2 118.60   59.30   28.31 1.43e-08 ***
Residuals   43  90.06    2.09                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusions?

The \( p \)-value is very small = very strong evidence against the null hypothesis.
At least one of the means is significantly different to one of the others
Which means are different? We can build 95% CIs comparing the means for each group, but each has an error rate of 5%
These errors compound so we might adjust the error rate of the intervals to obtain an overall error rate of 5%

Conclusions?

We can use often the Tukey option (which represents 'Tukey Honest Significant Differences') to produce confidence intervals with an overall error rate of 5%
Adjustments follow later

What have we assumed?

In order for \( F \)-test results to be valid, we need the following assumptions to be met:

Independence: The data are sampled independently
Normality: The data for each group appears to have come from a Normal distribution
Constant spread: The underlying standard deviations for each group appear to be equal

Again - all about the noise: A simple model applies - the noise looks like independent draws from one Normal distribution

What have we assumed?

The 'independence assumption' is critical for valid results, while \( F \)-test results are reasonably robust to departures from the 'constant spread' and Normality' assumptions.
Confidence intervals (using the Tukey option for example) are not robust to departures from the constant spread' assumption.
A conservative rule of thumb: an \( F \)-test should give reliable results if the largest standard deviation of the groups is no larger than twice the smallest standard deviation of the groups.

Conclusions?

We'll assume the values are independent - hopefully the data collection was sensible.

Let's check the others:

Conclusions?

  qqnorm(weedANOVA$residuals)
  qqline(weedANOVA$residuals)

Roughly a straight line. Residuals (error/noise/inexplicable stuff) are pretty Normal. Tick

plot of chunk unnamed-chunk-11

Conclusions?

  shapiro.test(weedANOVA$residuals)


    Shapiro-Wilk normality test

data:  weedANOVA$residuals
W = 0.98038, p-value = 0.6216

Big \( p \)-value, fail to reject \( H_0 \): “The data are normal”. Tick

Conclusions?

  TiData %>% group_by(Group) %>% summarise(SD = sd(Ti))

# A tibble: 3 x 2
  Group    SD
  <fct> <dbl>
1 bhb    1.32
2 nth    1.62
3 pm     1.45

They're all pretty similar, tick.

Post hoc tests/analysis

So, you've found a significant ANOVA result

The immediate question is: what is different from what?

So, we're back to the multiple comparisons problem - this can be addressed by modified testing (adjustments for multiple comparisons)

Post hoc tests/analysis

Recall

We were not doing lots of \( t \)-tests due to the inflation of the type-1 error
We conduct an overarching test (sometimes called an omnibus test), which means our type-1 error is fixed
- Non-significant? no problem, we stop. Any individual differences we find are likely to be sampling variation
- Significant? want to look at specific comparisons to find the cause: was A>B? C<D? etc

Adjustments for multiple comparisons

The obvious(?) fix, is to alter our significance threshold for each test (make it “harder”), so the collective type-1 is reasonable

This overall error rate is the family error rate
It is achived by alterations to the multiple individual tests

There are many such approaches and they are not without controversy (refer end of lecture)

Adjustments for multiple comparisons

We will consider 3 types of adjustment here (many many more):

Simple \( p \)-value adjustment - the Bonferroni adjustment
Simple \( p \)-value adjustment - the Sidak adjustment
All possible pairs - Tukey's Honest Significant Difference (HSD)

Bonferroni

This one is really easy. We set a threshold \( p \)-value by dividiing by the number of comparisons we plan:

\[ \alpha_{adj} = \alpha/k \]

where \( \alpha_{adj} \) is our new threshold, \( \alpha \) is the desired type 1 error collectively (family error rate) and \( k \) is the number of comparisons

Sidak adjustment

For example, if we have a control, we plan to compare against each of 3 treatments, we adjust:

\[ \alpha_{adj} = \alpha/k = 0.05/3 = 0.01666667 \]

So we need \( p \)-value < 0.0166 to conclude a significant result.

Observe the probability of 1 more more false positives is:

1-pbinom(0, 3,  0.05/3) # recall  from binomial lectures

[1] 0.0491713

It gives a family error rate \( \le \alpha \)

Sidak (Šidák) adjustment

We set a threshold \( p \)-value based on the number of comparisons we plan:

\[ \alpha_{adj} = 1 - (1 - \alpha)^{1/k} \]

where \( \alpha_{adj} \) is our new threshold, \( \alpha \) is the desired type 1 error collectively (family error rate) and \( k \) is the number of comparisons

Sidak adjustment

For example, if we have a control, we plan to compare against each of 3 treatments, to give overall type 1 error of 5%

\[ \alpha_{adj} = 1 - (1 - \alpha)^{1/k} = 1 - (1 - 0.05)^{1/3} = 0.01695243 \]

So we need \( p \)-value < 0.01695 to conclude a significant result.

Observe the probability of 1 more more false positives is:

1-pbinom(0, 3, 0.01695243) # recall  from binomial lectures

[1] 0.05000001

Which is the rationale

Tukey's Honest Significant Difference (HSD)

This is generally well-regarded (the previous ones often considered too conservative).

We'll not look at the details extensively
It is similar to creating confidence intervals for differences like we did with \( t \)-distributions, but modifies the SE and multiplier
The intervals become wider
We can then check to see if zero is contained within these to determine significance with a family rate of \( \alpha \)

Tukey's Honest Significant Difference (HSD)

For example - using our ANOVA previously

  TukeyHSD(weedANOVA)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Ti ~ Group, data = TiData)

$Group
            diff        lwr      upr     p adj
nth-bhb 2.477778  0.9544691 4.001086 0.0008236
pm-bhb  3.750000  2.5402571 4.959743 0.0000000
pm-nth  1.272222 -0.1008697 2.645314 0.0743237

Tukey's Honest Significant Difference (HSD)

So, conclude the ANOVA result was due to:

nth being significantly higher than bhb - on average 0.95 to 4.00 units higher (95% confidence)
pm being significantly higher than bhb - on average 2.54 to 4.96 units higher (95% confidence)

Whereas pm and nth are not significantly different - on average pm was -0.1 units lower, to 1.27 units higher than nth (95% confidence), so 0 is plausible

Multiple comparison adjustment controversy

For the interested - papers on moodel:

Rothman (1990) - “No Adjustments Are Needed for Multiple Comparisons”
Tukey (1991) “The philosophy of multiple comparisons”

Multiple comparison controversy

In short, we are trading one error for another

Type 1 error: the incorrect rejection of \( H_0 \)
- i.e. a false positive/falsely significant result
Type 2 error: the incorrect failure to reject \( H_0 \)
- i.e. a false negative/falsely insignificant result

Clearly we can control type 1 at the cost of type 2:

Set the threshold for significance to be a very small \( p \)-value - we won't make many type 1 errors, but type 2 could be large

Multiple comparison controversy

This relates to a concept of power, which we cover next.

The choice of adjusting or not is not trivial - our conclusions can change massively
Whether to do it or not is context specific:
- Sometimes people think you obviously should, because they are really worried about type 1 errors/false positives (e.g. decide a treatment is effective when it isn't)
- Sometimes it isn't obvious, because type 2 errors/false negatives could be concerning (e.g. exploring new cancer drugs and might miss something promising)

Recap and look-forwards

We've covered:

Comparing many means simultaneously - ANOVA
(hence, the \( F \) distribution)
Assumptions and checking for this model
Post-hoc comparisons

Power and type 2 errors
A bit of matrix algebra
A bit of likelihood
Basically, the start of our linear model journey