Comparing Means from Multiple groups using ANOVA

We often want to compare means among multiple groups. In a hypothesis testing scenario, this is basically testing the null hypothesis that all of their population means are equal.

\(H_0: \mu_1 = \mu_2 = ... = \mu_k\) for \(k\) different groups against \(H_A:\) at least one pair of the means are different

Another way of asking this question is to consider whether all of different groups are conceivably drawn from the same underlying population.

To examine this situation and motivate our test, let’s load a library and look at some data. Specifically, we are going to use some data on the flower morphology of three different species of iris. Specifically, let’s look at the width of the sepals, the petal-like structures that irises use to attract their pollinators.

require(mosaic)
data(iris)
histogram(~Sepal.Width, fit="normal", data=iris)

These data clearly look quite normally distributed. However, we need to keep in mind that we are actually looking at three different species. Do the species differ in their sepal widths.

histogram(~Sepal.Width|Species, fit="normal", data=iris, layout=c(1,3))

Consider the overall variation that we find in the first graph and how it sorts out into the three different species. One way to ask whether the species differ is to consider how much variation there is between the species relative to how much variation there is within each species on average. Compare the overall statistics of the irises to the statistics for each species individually.

favstats(~Sepal.Width, data=iris)
##  min  Q1 median  Q3 max     mean        sd   n missing
##    2 2.8      3 3.3 4.4 3.057333 0.4358663 150       0
favstats(Sepal.Width~Species, data=iris)
##      Species min    Q1 median    Q3 max  mean        sd  n missing
## 1     setosa 2.3 3.200    3.4 3.675 4.4 3.428 0.3790644 50       0
## 2 versicolor 2.0 2.525    2.8 3.000 3.4 2.770 0.3137983 50       0
## 3  virginica 2.2 2.800    3.0 3.175 3.8 2.974 0.3224966 50       0

Partitioning Variation

In an Analysis of Variance (ANOVA) our goal is to actually statistically model the variation that we observe. In the simplest terms, we can hypothesize that the total variability that we observe can be broken down into variation between our groups and variation within groups.

DATA = MODEL(GROUP) + ERROR(WITHIN GROUP)

Mathematically, just like when we calculated the variance of a sample, we can describe variation by looking at sums of squared deviations of observations from some mean. It turns out that these sums of squares (\(SS\)) actually add up, such that

\(SS_T = SS_M + SS_E\).

Here \(SS_T\) is the total sum of squares (the sum of the squared deviations from the overall mean), \(SS_M\) is the model (or group) sum of squares (the sum of the deviations of each group mean from the overall mean), and \(SS_E\) is the error sum of squares (the sum of each observation from its group mean). Mathematically:

\(SS_T = \sum_{j=1}^{N}(x_j-\bar{x}_G)^2\)

\(SS_M = \sum_{i=1}^{k}n_i(\bar{x}_i-\bar{x}_G)^2\)

\(SS_E = \sum_{i=1}^{k}(n_i-1)s_i^2\)

Each of these sums of squares also has a degrees of freedom (df) associated with it,

\(DF_T = DF_M + DF_E\) = \((N-1) = (k-1) + (N-k)\)

We can divide each sum of squares by its corresponding degrees of freedom to yeild mean squared deviations or mean squares for each component of the variation.

The total mean square is just the overall variance: \(MS_T = \frac{SS_T}{N-1} = s^2_G\)

The model mean square is the average variation (SS) of the group means from the overall grand mean: \(MS_M = \frac{SS_M}{k-1}\)

The error mean square is the average SS of observations from their group means: \(MS_E = \frac{SS_E}{N-k}\)

So to get back to our original question, we want to compare \(MS_M\) and \(MS_E\). We can do this by taking their ratio, which gives us our test statistic \(F\)

\(F = \frac{MS_M}{MS_E}\)

This statistic has a well defined distribution with two parameters, \(df_1=k-1\) and \(df_2=N-k\).

In our iris example, \(k-1 = 2\) and \(N-k=147\). In this case the F-distribution looks like this

plotDist('f', df1=2, df2=147)

The value of \(F\) gets large when the variation between group means (\(MS_M\)) is large compared to variation within groups (\(MS_E\)).

We can calculate all of these components of variation using the lm() function.

sepal.mod<-lm(Sepal.Width~Species, data=iris)
anova(sepal.mod)
## Analysis of Variance Table
## 
## Response: Sepal.Width
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Species     2 11.345  5.6725   49.16 < 2.2e-16 ***
## Residuals 147 16.962  0.1154                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This “ANOVA table” gives us all of the requisite information, and by comparing our value of \(F=49.16\) to the null distribution above, we can see why the p-value is basically zero.

So what is the result of our hypothesis test?