Anova Testing

2026-04-14

Definition

ANOVA: ANalysis of VAriance
we are testing for the variance in means between 3+ groups
uses the test statistic F

Conditions

The 4 assumptions that need met:

each group’s variance must be equal(\(\sigma^2_1\) = \(\sigma^2_2\) =\(\sigma^2_3\) = … \(\sigma^2_K\))
all groups are normally distributed
groups are independently sampled
within said groups the individuals are also independently sampled

Hypothesis

Null hypothesis: The means of each group are all the same \[H_0: \mu_1 = \mu_2 = \mu_3 = ... = \mu_K\]
Alternative hypothesis: at least 1 mean of a group is different from the others \[H_a: at\ least\ one\ \mu_K\ differs\]

Output

Five output values:
- Degrees of Freedom
- Sum of Squares
- Mean Sum of Squares
- F statistic
- Significance F (p-value)
will be denoted as a table with three rows (Factor, Error, Total)

Degrees of Freedom

denoted as df
Factor: \(df_F = K - 1\), where K is number of groups
Total: \(df_T = n - 1\), where n is the total sample size of all groups
Error: \(df_E = df_T - df_F\)

Sum of Squares

Factor: \(SS_{Factor} = \sum_{j = 1}^{K} n_j(\bar{x_j} - \bar{x})^2\)
Error: \(SS_{Error} = \sum_{j =1}^{K} \sum_{i=1}^{n_j} (x_{ij} - \bar{x_j})^2\)
Total: \(SS_{Total} = SS_{Factor} + SS_{Error}\)

F Statistic

\[F = \frac{SS_{Factor}/(K-1)}{SS_{Error}/(N-K)}\] - F significance is the p-value

Example

Suppose we are interested in determining whether the average test scores of the same test differ between different classes. We will look at five different class sections Class 1, Class 2, Class 3, Class 4 and Class 5. The null hypothesis is: \[H_0: \mu_{Class 1} = \mu_{Class 2} = \mu_{Class 3} = \mu_{Class 4} = \mu_{Class 5}\] The alternate hypothesis is: \[H_a: at\ least\ one\ \mu\ is\ different\] - we will be using a significance level of 0.05

Example

columns = class[, c("Class.1", "Class.2", "Class.3", "Class.4", "Class.5")]

means = sapply(columns, function(x) mean(x, na.rm = TRUE))
standarddeviation = sapply(columns, function(x) sd(x, na.rm = TRUE))
samplesize = sapply(columns, function(x) sum(!is.na(x)))

library(plotly)
plot_ly(
  type = 'table',
  header = list(values = c("Class", "Means", "Standard Deviation", "Sample Size")), cells = list(values = list(c("Class.1", "Class.2", "Class.3", "Class.4", "Class.5"), round(means, 2), round(standarddeviation, 2), samplesize))
)

Example

The low difference in variances and the similarity in the graphs allow us to conclude that the data is normal.

Example

##              Df Sum Sq Mean Sq F value Pr(>F)
## Class         4   2587  646.68   0.759 0.5536
## Residuals   149 126955  852.05

Example

Because the p-value(0.5536) is larger than 0.05 we are able to conclude that there is not enough evidence to conclude that the means of test scores are different between the classes.