March 12, 2021

ANOVA

  • ANOVA is an acronym for “ANalysis Of VAriance”
  • A statistical technique used for comparing differences in a scale-level dependent variable by a nominal-level variable having 2 or more categories
  • An extension of the \(t\) and the \(z\) test
  • For this presentation, we will be exploring the usage of a one-way ANOVA test

How a one-way ANOVA test works

  • Assume that we have 3 groups (A,B,C) to compare
  1. Compute the common variance, which is called variance within samples or residual variance.
  2. Compute the variance between sample means by computing the mean of each group (A,B,C) and then computing the variance between sample means
  3. Produce F-statistic as the ratio of variance between sample means/variance within samples
  • Note: A lower ratio (ratio<1) indicates that there are no significant differences between the means of the samples being compared. Alternatively, a higher ratio implies significance.

Data Exploration Time!

  • Now, let’s see an example!
  • Here we will explore a data set from Kaggle consisting of the marks secured by high school students from the United States in various subjects
  • This data set also includes supplemental information on each student including the educational background of parents, test preparation, lunch type, etc.
  • In this presentation, we will be comparing mean responses for math test scores amongst different groups

A Quick Look at the Student Performance Data Set

##   parental.level.of.education        lunch gender math.score
## 1           bachelor's degree     standard female         72
## 2                some college     standard female         69
## 3             master's degree     standard female         90
## 4          associate's degree free/reduced   male         47
## 5                some college     standard   male         76
## 6          associate's degree     standard female         71

Calculating Mean & SD of Student Math Scores by Levels of Parental Education

## # A tibble: 6 x 4
##   parental.level.of.education count  mean    sd
## * <chr>                       <int> <dbl> <dbl>
## 1 associate's degree            222  67.9  15.1
## 2 bachelor's degree             118  69.4  14.9
## 3 high school                   196  62.1  14.5
## 4 master's degree                59  69.7  15.2
## 5 some college                  226  67.1  14.3
## 6 some high school              179  63.5  15.9

Visualizing the Data

Computing One-Way ANOVA Tests

  • Look at the p-values as well as the significance codes
  • If the p-value is less than the significance levels, we can conclude that there are significant differences between the groups
anova <- aov(math.score ~ parental.level.of.education + lunch + gender + 
               race.ethnicity, data = dfstuper)
print(summary(anova))
##                              Df Sum Sq Mean Sq F value   Pr(>F)    
## parental.level.of.education   5   7296    1459   8.099 1.68e-07 ***
## lunch                         1  28724   28724 159.434  < 2e-16 ***
## gender                        1   6601    6601  36.640 2.01e-09 ***
## race.ethnicity                4   9070    2268  12.586 5.42e-10 ***
## Residuals                   988 177998     180                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Visualizing Math Scores by Lunch Type!

Visualizing Math Scores by Gender!

Concluding Thoughts

  • R makes it extremely easy to do a one-way ANOVA test using the aov function
  • Box plots are a great way to visualize the data when doing ANOVA and can easily be created using packages in R like ggplot2 or plot_ly
  • If we were to take this analysis one step further, R even has other functions to perform multiple pairwise-comparison between the means of groups such as Tukey Honest Significant Difference (TukeyHSD)
  • THANK YOU for your time and feel to free ask any questions :)