2026-04-14

Definition

  • ANOVA: ANalysis of VAriance
  • we are testing for the variance in means between 3+ groups
  • uses the test statistic F

Conditions

The 4 assumptions that need met:

  • each group’s variance must be equal(\(\sigma^2_1\) = \(\sigma^2_2\) =\(\sigma^2_3\) = … \(\sigma^2_K\))
  • all groups are normally distributed
  • groups are independently sampled
  • within said groups the individuals are also independently sampled

Hypothesis

  • Null hypothesis: The means of each group are all the same \[H_0: \mu_1 = \mu_2 = \mu_3 = ... = \mu_K\]
  • Alternative hypothesis: at least 1 mean of a group is different from the others \[H_a: at\ least\ one\ \mu_K\ differs\]

Output

  • Five output values:
    • Degrees of Freedom
    • Sum of Squares
    • Mean Sum of Squares
    • F statistic
    • Significance F (p-value)
  • will be denoted as a table with three rows (Factor, Error, Total)

Degrees of Freedom

  • denoted as df
  • Factor: \(df_F = K - 1\), where K is number of groups
  • Total: \(df_T = n - 1\), where n is the total sample size of all groups
  • Error: \(df_E = df_T - df_F\)

Sum of Squares

  • Factor: \(SS_{Factor} = \sum_{j = 1}^{K} n_j(\bar{x_j} - \bar{x})^2\)
  • Error: \(SS_{Error} = \sum_{j =1}^{K} \sum_{i=1}^{n_j} (x_{ij} - \bar{x_j})^2\)
  • Total: \(SS_{Total} = SS_{Factor} + SS_{Error}\)

F Statistic

\[F = \frac{SS_{Factor}/(K-1)}{SS_{Error}/(N-K)}\] - F significance is the p-value

Example

Suppose we are interested in determining whether the average test scores of the same test differ between different classes. We will look at five different class sections Class 1, Class 2, Class 3, Class 4 and Class 5. The null hypothesis is: \[H_0: \mu_{Class 1} = \mu_{Class 2} = \mu_{Class 3} = \mu_{Class 4} = \mu_{Class 5}\] The alternate hypothesis is: \[H_a: at\ least\ one\ \mu\ is\ different\] - we will be using a significance level of 0.05

Example

columns = class[, c("Class.1", "Class.2", "Class.3", "Class.4", "Class.5")]

means = sapply(columns, function(x) mean(x, na.rm = TRUE))
standarddeviation = sapply(columns, function(x) sd(x, na.rm = TRUE))
samplesize = sapply(columns, function(x) sum(!is.na(x)))

library(plotly)
plot_ly(
  type = 'table',
  header = list(values = c("Class", "Means", "Standard Deviation", "Sample Size")), cells = list(values = list(c("Class.1", "Class.2", "Class.3", "Class.4", "Class.5"), round(means, 2), round(standarddeviation, 2), samplesize))
)

Example

Example

Example

  • The low difference in variances and the similarity in the graphs allow us to conclude that the data is normal.

Example

##              Df Sum Sq Mean Sq F value Pr(>F)
## Class         4   2587  646.68   0.759 0.5536
## Residuals   149 126955  852.05

Example

Because the p-value(0.5536) is larger than 0.05 we are able to conclude that there is not enough evidence to conclude that the means of test scores are different between the classes.