03/31/2024

Introduction to ANOVA and Pizza

This presentation will explore the statistical ANOVA test, while exploring a data set on pizza price across the United States.
  • image taken from: What is analysis of variance (ANOVA), 2024.

Data for This Presentation

To give some working examples on ANOVA testing, I will be pulling in some data on pizza statistics (Datafiniti & Shion, 2019). The goal will be to see if there is any statistical significance on pizza prices across the United States.

Key Information about the Dataset:

  • 9,947 observations after removing any price over $40 (this removed 53 observations, which was the majority of the outliers)

Variables examined:

  • menus.amountMax: Price of the pizza
  • city: City name where the pizza was served
  • province: State abbreviation where the pizza was served

Visualizing Pizza Prices Across States

Before doing any ANOVA testing, I want to provide an overview of the prices across each state. For that, I will show the code for a simple boxplot on this slide. On the next slide, I will show the actual graph.

ggplot(pizza_data_clean, aes(x = province, y = menus.amountMax)) + 
  geom_boxplot() + labs(title = "Boxplot of Pizza Prices by State", 
                        x = "State (Province)", 
                        y = "Maximum Menu Price (USD)") + 
  theme_minimal() + theme(axis.text.x = element_text(angle = 90, 
                                                     vjust = 0.5),
                          plot.title = element_text(hjust = 0.5, 
                                                    face = "bold"))

Boxplot of Pizza Prices by State

Understanding ANOVA

ANOVA (Analysis of Variance) testing is a statistical method used to compare means across multiple groups, to see if there are statistically significant differences between them (PennState Eberly College of Science, n.d.). ANOVA testing works by comparing the within-group variance to between-group variance. The F-Statistic, below, is used to compare this variance.

\(F = \frac{MS_{between}}{MS_{within}}\)

The next slide will use ANOVA testing to see if where you are within the United States with affect the price of your pizza.

ANOVA and Pizza Prices

##               Df Sum Sq Mean Sq F value Pr(>F)    
## province      43  31648     736   18.39 <2e-16 ***
## Residuals   9903 396360      40                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The code above is the ANOVA summary from the dataset. The F-Value being equal to 18.39 is much higher than if the price were spread randomly (in which case, the expected F-Value would be ~1.0). Additionally, since the p-value < 0.001 (in this case, <2e-16), the null hypothesis (i.e., that there is no significant price difference between states) can be confidently rejected. Thus, this ANOVA test strongly indicates that location does influence pizza price.

Post-Hoc Analysis

Since a significant variance was detected between states on the previous slide, now it is time to dive a little deeper with Post-Hoc testing (i.e., testing done after a significance test to determine where the differences are) (McClenaghan, E., 2023). For this example, I will use Tukey’s Honestly Significant Different (HSD) Test. The HSD formula is shown below (Zaiontz, C., n.d.).

\(HSD = q\sqrt{\frac{MS_{within}}{n}}\)

  • q = studentized range distrubition
  • MS_within = mean square within groups
  • n = sample size per group.

HSD Post-Hoc Graph

USA Map of Average Pizza Prices

Summary:

Key Findings:

  • ANOVA testing can check if there are significant differences in a variable (like pizza price) across multiple other variables (like states in the USA).
  • Post-Hoc testing (like HSD) can then be used after finding a significant difference, to narrow down exactly where the significant difference was.

Next Steps:

  • Further research could examine pizza price in addition to other variables, such as whether or not the pizza is in a rural or urban area, or if pizza price correlates with overal cost of living.

References