The data set is called “pizza_jared” because Jared Lander, of Lander Analytics, hosts the monthly New York Open Statistical Programming Meetup. Pizza is served at every meetup and we are reminded to rate the pizza using an online app at the end. I’ll be calling it the New York Open Statistical Programming Meetup data set.
Conflict of Interest disclosure: I attend NYOSP meetups and have rated pizzas
Load the data from the Tidy Tuesday Repository. I’m only interested in the pizza_barstool and pizza_jared data sets because it is a scientific fact that there is absolutely no pizza in the United States outside of New York City worth risking a papercut by opening the box.
pizza_jared <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-01/pizza_jared.csv") %>%
# mutate(poll_date = as.yearmon(lubridate::ymd_hms(as.POSIXct(time, origin = "1970-01-01") ))) %>%
mutate(num_ans = as.numeric(forcats::fct_rev(fct_inorder(answer))))
## Parsed with column specification:
## cols(
## polla_qid = col_double(),
## answer = col_character(),
## votes = col_double(),
## pollq_id = col_double(),
## question = col_character(),
## place = col_character(),
## time = col_double(),
## total_votes = col_double(),
## percent = col_double()
## )
pizza_barstool <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-01/pizza_barstool.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## name = col_character(),
## address1 = col_character(),
## city = col_character(),
## country = col_character()
## )
## See spec(...) for full column specifications.
Filter the larger Barstool data to include only those pizzerias that are present in the Jared set. Due to common pizzeria names, I filterd out non-New York cities.
The Jared pizza poll uses a 6 point likert scale Excellent->Good->Average->Poor->Fair->Never Again which captures the response as a string, so we need to first transform the factors to a numerical scale.
There are several score averages in the Barstool data set. I’m using the average of all ratings. Barstool uses a 0 to 10 scale.
I normalized the data by centering and scaling each data set’s review scores.
pizza_jared<-pizza_jared %>%
# manually set the levels from worst to best, transform them to numbers so that the better rating has a higher numerical score
mutate(num_ans = as.numeric(factor(pizza_jared$answer, levels = c("Never Again", "Poor", "Fair", "Average", "Good", "Excellent"))))
pizzas<-pizza_barstool %>% filter(name %in% pizza_jared$place) %>%
select(name, avg_score = review_stats_all_average_score, price_level, city) %>%
filter(!city %in% c("Miami","Santa Monica"))
avg_jared <- pizza_jared %>% select(place, num_ans,answer,votes) %>%
mutate(score=num_ans * votes) %>%
group_by(place) %>%
summarise(tot_score = sum(score),
tot_votes = sum(votes)) %>%
ungroup() %>%
filter(tot_votes !=0) %>%
mutate(jared_score = tot_score/tot_votes) %>%
left_join(pizzas, by=c("place" = "name")) %>%
filter(!is.na(avg_score)) %>%
#scale each set of score averages
mutate(scale_jared = unlist(as.list(scale(jared_score))),
scale_avg = unlist(as.list(scale(avg_score)))) %>%
unnest() %>%
mutate(image="pz.png")
#create a long data set for the density plot
pizza_long<-avg_jared %>%
ungroup() %>%
select(place, scale_jared, scale_avg) %>%
gather(key=rater, value=score, 2:3)
Jared’s meetup, NYSOP, displays a slightly stronger bimodal distribution than the Barstool folks.
ggplot(pizza_long)+
geom_density(aes(x=score,fill=rater), alpha=0.3)+
scale_fill_manual(labels=c("Barstool", "New York Open\nData Science Meetup"),values=c(scale_avg = "darkgreen", scale_jared = "red"))+
labs(title = "New York Open Statistical Programming Meetup Pizza Eaters Have\nSlightly Stronger Opinions Than the Average Barstool Rater", fill = "Rater",
caption="data: https://github.com/rfordatascience/tidytuesday\nJared Lander\nBarstool Sports
viz: @WireMonkey
#TidyTuesday Week 40, 2019")+
theme(legend.position = "bottom", panel.background = element_blank())
As we noted earlier, there is a stronger bimodal distribution for NYOSP ratings, but they (we) tend toward higher ratings. Perhaps this is selection bias by Jared, who is trying to serve NYC’s data elite only the finest pies.
I used analysis of variance to discover the three pizza joints with the highest variance in rating between the two data sets. I extracted the data by pulling the absolute value of the standardized residuals and and filtering the data set to get the pizzeria names.
The green line shows a normal regression, the red line is the actual linear regression.
Since this data set was suggested during Luda Janda’s ggplot presentation, I felt compelled to add some bling to my plot.
library(ggimage)
## Warning: package 'ggimage' was built under R version 3.6.1
#get the outliers where variance is greatest
pizza_aov<-aov(scale_avg~scale_jared, avg_jared)
r<-rstandard(pizza_aov) ## get standardised residuals
order(abs(r), decreasing = TRUE)[1:3]
## [1] 1 18 23
labels<-avg_jared[order(abs(r), decreasing = TRUE)[1:3], ]
pizza_guy<-png::readPNG("pz_chef.png")
avg_jared %>%
mutate(image="pz.png") %>%
ggplot(., aes(x=scale_jared, y=scale_avg))+
scale_x_continuous(limits=c(-2, 2))+
scale_y_continuous(limits=c(-2, 2))+
geom_abline(slope = 1, color="darkgreen", size=2)+
annotation_raster(pizza_guy, ymin=-2, ymax=-.5, xmin=1,xmax = 2)+
labs(title="Points of Contention", subtitle = "comparing ratings of pizza joints by NYOSP\nand the average Barstool rater, greatest difference of opinion labeled",
x="New York Open Statistical Programming Meetup", y="Barstool Raters",
caption="data: https://github.com/rfordatascience/tidytuesday\nJared Lander\nBarstool Sports
viz: @WireMonkey
#TidyTuesday Week 40, 2019")+
ggimage::geom_image(aes(image=image), size=.075)+
geom_text(data=labels, aes(x=scale_jared, y=scale_avg, label=place), nudge_y = .25)+
geom_smooth(method="lm", color="red")+
theme(panel.grid = element_blank(),
panel.background = element_blank())