How Do New York Open Statistical Programming Meetup Members Compare to the Average Barstool Pizza Rater?

The data set is called “pizza_jared” because Jared Lander, of Lander Analytics, hosts the monthly New York Open Statistical Programming Meetup. Pizza is served at every meetup and we are reminded to rate the pizza using an online app at the end. I’ll be calling it the New York Open Statistical Programming Meetup data set.

Conflict of Interest disclosure: I attend NYOSP meetups and have rated pizzas

Load

Load the data from the Tidy Tuesday Repository. I’m only interested in the pizza_barstool and pizza_jared data sets because it is a scientific fact that there is absolutely no pizza in the United States outside of New York City worth risking a papercut by opening the box.

pizza_jared <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-01/pizza_jared.csv") %>% 
  # mutate(poll_date = as.yearmon(lubridate::ymd_hms(as.POSIXct(time, origin = "1970-01-01") ))) %>% 
  mutate(num_ans = as.numeric(forcats::fct_rev(fct_inorder(answer))))
## Parsed with column specification:
## cols(
##   polla_qid = col_double(),
##   answer = col_character(),
##   votes = col_double(),
##   pollq_id = col_double(),
##   question = col_character(),
##   place = col_character(),
##   time = col_double(),
##   total_votes = col_double(),
##   percent = col_double()
## )
pizza_barstool <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-01/pizza_barstool.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   name = col_character(),
##   address1 = col_character(),
##   city = col_character(),
##   country = col_character()
## )
## See spec(...) for full column specifications.

Then Bake

Filter the larger Barstool data to include only those pizzerias that are present in the Jared set. Due to common pizzeria names, I filterd out non-New York cities.

The Jared pizza poll uses a 6 point likert scale Excellent->Good->Average->Poor->Fair->Never Again which captures the response as a string, so we need to first transform the factors to a numerical scale.

There are several score averages in the Barstool data set. I’m using the average of all ratings. Barstool uses a 0 to 10 scale.

I normalized the data by centering and scaling each data set’s review scores.

pizza_jared<-pizza_jared %>% 
  # manually set the levels from worst to best, transform them to numbers so that the better rating has a higher numerical score
  mutate(num_ans = as.numeric(factor(pizza_jared$answer, levels = c("Never Again", "Poor", "Fair", "Average", "Good", "Excellent"))))

pizzas<-pizza_barstool %>% filter(name %in% pizza_jared$place) %>% 
  select(name, avg_score = review_stats_all_average_score, price_level, city) %>% 
  filter(!city %in% c("Miami","Santa Monica"))

avg_jared <- pizza_jared %>% select(place, num_ans,answer,votes) %>% 
  mutate(score=num_ans * votes) %>% 
  group_by(place) %>% 
  summarise(tot_score = sum(score),
            tot_votes = sum(votes)) %>% 
  ungroup() %>% 
  filter(tot_votes !=0) %>% 
  mutate(jared_score = tot_score/tot_votes) %>% 
  left_join(pizzas, by=c("place" = "name")) %>% 
  filter(!is.na(avg_score)) %>% 
  #scale each set of score averages
  mutate(scale_jared = unlist(as.list(scale(jared_score))), 
         scale_avg = unlist(as.list(scale(avg_score)))) %>% 
  unnest() %>% 
  mutate(image="pz.png")

#create a long data set for the density plot
pizza_long<-avg_jared %>% 
  ungroup() %>% 
  select(place, scale_jared, scale_avg) %>% 
  gather(key=rater, value=score, 2:3)

Density Comparison Between NYOSP and Barstool

Jared’s meetup, NYSOP, displays a slightly stronger bimodal distribution than the Barstool folks.

  ggplot(pizza_long)+
  geom_density(aes(x=score,fill=rater), alpha=0.3)+
  scale_fill_manual(labels=c("Barstool", "New York Open\nData Science Meetup"),values=c(scale_avg = "darkgreen", scale_jared = "red"))+
  labs(title = "New York Open Statistical Programming Meetup Pizza Eaters Have\nSlightly Stronger Opinions Than the Average Barstool Rater", fill = "Rater",
       caption="data: https://github.com/rfordatascience/tidytuesday\nJared Lander\nBarstool Sports
       viz: @WireMonkey
       #TidyTuesday Week 40, 2019")+
  theme(legend.position = "bottom", panel.background = element_blank())

Point Comparison and Outliers

As we noted earlier, there is a stronger bimodal distribution for NYOSP ratings, but they (we) tend toward higher ratings. Perhaps this is selection bias by Jared, who is trying to serve NYC’s data elite only the finest pies.

I used analysis of variance to discover the three pizza joints with the highest variance in rating between the two data sets. I extracted the data by pulling the absolute value of the standardized residuals and and filtering the data set to get the pizzeria names.

The green line shows a normal regression, the red line is the actual linear regression.

Since this data set was suggested during Luda Janda’s ggplot presentation, I felt compelled to add some bling to my plot.

library(ggimage)
## Warning: package 'ggimage' was built under R version 3.6.1
#get the outliers where variance is greatest
pizza_aov<-aov(scale_avg~scale_jared, avg_jared)
r<-rstandard(pizza_aov)  ## get standardised residuals
order(abs(r), decreasing = TRUE)[1:3]
## [1]  1 18 23
labels<-avg_jared[order(abs(r), decreasing = TRUE)[1:3], ]

pizza_guy<-png::readPNG("pz_chef.png")
avg_jared %>% 
  mutate(image="pz.png") %>% 
ggplot(., aes(x=scale_jared, y=scale_avg))+
  scale_x_continuous(limits=c(-2, 2))+
  scale_y_continuous(limits=c(-2, 2))+
  geom_abline(slope = 1, color="darkgreen", size=2)+
  annotation_raster(pizza_guy, ymin=-2, ymax=-.5, xmin=1,xmax = 2)+
  labs(title="Points of Contention", subtitle = "comparing ratings of pizza joints by NYOSP\nand the average Barstool rater, greatest difference of opinion labeled",
       x="New York Open Statistical Programming Meetup", y="Barstool Raters",
       caption="data: https://github.com/rfordatascience/tidytuesday\nJared Lander\nBarstool Sports
       viz: @WireMonkey
       #TidyTuesday Week 40, 2019")+
  ggimage::geom_image(aes(image=image), size=.075)+
  geom_text(data=labels, aes(x=scale_jared, y=scale_avg, label=place), nudge_y = .25)+
  geom_smooth(method="lm", color="red")+
  theme(panel.grid = element_blank(),
        panel.background = element_blank())