- image taken from: What is analysis of variance (ANOVA), 2024.
03/31/2024
To give some working examples on ANOVA testing, I will be pulling in some data on pizza statistics (Datafiniti & Shion, 2019). The goal will be to see if there is any statistical significance on pizza prices across the United States.
Key Information about the Dataset:
Variables examined:
menus.amountMax: Price of the pizzacity: City name where the pizza was servedprovince: State abbreviation where the pizza was servedBefore doing any ANOVA testing, I want to provide an overview of the prices across each state. For that, I will show the code for a simple boxplot on this slide. On the next slide, I will show the actual graph.
ggplot(pizza_data_clean, aes(x = province, y = menus.amountMax)) +
geom_boxplot() + labs(title = "Boxplot of Pizza Prices by State",
x = "State (Province)",
y = "Maximum Menu Price (USD)") +
theme_minimal() + theme(axis.text.x = element_text(angle = 90,
vjust = 0.5),
plot.title = element_text(hjust = 0.5,
face = "bold"))
ANOVA (Analysis of Variance) testing is a statistical method used to compare means across multiple groups, to see if there are statistically significant differences between them (PennState Eberly College of Science, n.d.). ANOVA testing works by comparing the within-group variance to between-group variance. The F-Statistic, below, is used to compare this variance.
\(F = \frac{MS_{between}}{MS_{within}}\)
The next slide will use ANOVA testing to see if where you are within the United States with affect the price of your pizza.
## Df Sum Sq Mean Sq F value Pr(>F) ## province 43 31648 736 18.39 <2e-16 *** ## Residuals 9903 396360 40 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The code above is the ANOVA summary from the dataset. The F-Value being equal to 18.39 is much higher than if the price were spread randomly (in which case, the expected F-Value would be ~1.0). Additionally, since the p-value < 0.001 (in this case, <2e-16), the null hypothesis (i.e., that there is no significant price difference between states) can be confidently rejected. Thus, this ANOVA test strongly indicates that location does influence pizza price.
Since a significant variance was detected between states on the previous slide, now it is time to dive a little deeper with Post-Hoc testing (i.e., testing done after a significance test to determine where the differences are) (McClenaghan, E., 2023). For this example, I will use Tukey’s Honestly Significant Different (HSD) Test. The HSD formula is shown below (Zaiontz, C., n.d.).
\(HSD = q\sqrt{\frac{MS_{within}}{n}}\)
q = studentized range distrubitionMS_within = mean square within groupsn = sample size per group.Datafiniti & Shion (2019). Pizza Restaurants and the Pizza They Sell (Version 2) [Data Set]. Kaggle. https://www.kaggle.com/datasets/datafiniti/pizza-restaurants-and-the-pizza-they-sell
McClenaghan, E. (2023). Post-Hoc Tests in Statistical Analysis. Technology Networks: Neuroscience News & Research. https://www.technologynetworks.com/neuroscience/articles/post-hoc-tests-in-statistical-analysis-371174
PennState Eberly College of Science. (n.d.). The ANOVA Table. https://online.stat.psu.edu/stat415/lesson/13/13.2
What is analysis of variance (ANOVA)? [Data Analysis]. (2024). Spotfire. https://www.spotfire.com/glossary/what-is-analysis-of-variance-anova
Zaiontz, C. (n.d.). Tukey HSD (Honestly Significant Difference). Real Statistics Using Excel. https://real-statistics.com/one-way-analysis-of-variance-anova/unplanned-comparisons/tukey-hsd/