In the analysis of these 3 datasets, we are addressing average rating, location, and pricing information of various pizza establishments around the US. The goal is to assist the average American consumer in making informed decisions about which pizza restaurants they would like to patronize.
This analysis is based on 3 datasets with different information about pizza restaurants in the US:
Barstool - Contains critic, public and the barstool staff’s rating as well as pricing, location and geolocation of the different pizza places.
Jared - Contains ratings for New York pizza restaurants on a 6-point likert scale.
Datafiniti - Contains price range, location and keywords for pizza restaurants around the US.
Region - Contains state by region from the US Census Bureau
To learn more about the datasets used, click here or view a more detailed explanation in the Data Preparation section below.
Our approach was to divide the analysis to focus on two groups of consumers: 1) those who are price-conscious and 2) those who are interested in finding the pizza place near them with the highest rating.
library(tidyverse) # data manipulation and cleaning
library(knitr) # data dictionary tables
library(kableExtra) # table formatting
library(readxl) # importing dataset descriptions
library(leaflet) # interactive map & geolocation
The pizza data was originally compiled on September 30th, 2019 as a part of a social data project called TidyTuesday. The original intent is for R users to practice their creativity and analytical techniques, specifically with the tidyverse package.
For those who are interested, there’s a 5-minute podcast reviewing the top entries and different visualization techniques for Week 7’s project called “Pizza Party”.
We created the regions dataset based on the definitions from the US Census Bureau.
| variable | class | description |
|---|---|---|
| polla_qid | integer | Quiz ID |
| answer | character | Answer (likert scale) |
| votes | integer | Number of votes for that question/answer combo |
| pollq_id | integer | Poll Question ID |
| question | character | Question |
| place | character | Pizza Place |
| time | integer | Time of quiz |
| total_votes | integer | Total number of votes for that pizza place |
| percent | double | Vote percent of total for that pizza place |
| variable | class | description |
|---|---|---|
| name | character | Pizza place name |
| address1 | character | Pizza place address |
| city | character | City |
| zip | double | Zip |
| country | character | Country |
| latitude | double | Latitude |
| longitude | double | Longitude |
| price_level | double | Price rating (smaller = cheaper) |
| provider_rating | double | Provider review score |
| provider_review_count | double | Provider review count |
| review_stats_all_average_score | double | Average Score |
| review_stats_all_count | double | Count of all reviews |
| review_stats_all_total_score | double | Review total score |
| review_stats_community_average_score | double | Community average score |
| review_stats_community_count | double | community review count |
| review_stats_community_total_score | double | community review total score |
| review_stats_critic_average_score | double | Critic average score |
| review_stats_critic_count | double | Critic review count |
| review_stats_critic_total_score | double | Critic total score |
| review_stats_dave_average_score | double | Dave (Barstool) average score |
| review_stats_dave_count | double | Dave review count |
| review_stats_dave_total_score | double | Dave total score |
| variable | class | description |
|---|---|---|
| polla_qid | integer | Quiz ID |
| answer | character | Answer (likert scale) |
| votes | integer | Number of votes for that question/answer combo |
| pollq_id | integer | Poll Question ID |
| question | character | Question |
| place | character | Pizza Place |
| time | integer | Time of quiz |
| total_votes | integer | Total number of votes for that pizza place |
| percent | double | Vote percent of total for that pizza place |
colSums(is.na(barstool))
## name address1
## 0 0
## city zip
## 0 0
## country latitude
## 0 2
## longitude price_level
## 2 0
## provider_rating provider_review_count
## 0 0
## review_stats_all_average_score review_stats_all_count
## 0 0
## review_stats_all_total_score review_stats_community_average_score
## 0 0
## review_stats_community_count review_stats_community_total_score
## 0 0
## review_stats_critic_average_score review_stats_critic_count
## 0 0
## review_stats_critic_total_score review_stats_dave_average_score
## 0 0
## review_stats_dave_count review_stats_dave_total_score
## 0 0
colSums(is.na(jared))
## polla_qid answer votes pollq_id question place
## 0 0 0 0 0 0
## time total_votes percent
## 0 0 5
colSums(is.na(datafiniti))
## name address city country
## 0 0 0 0
## province latitude longitude categories
## 0 0 0 0
## price_range_min price_range_max
## 0 0
colSums(is.na(region))
## State State Code Region Division
## 0 0 0 0
The jared dataset pizza ratings are on a 1 to 6 likert scale. However, barstool is on a 1-10 scale. We used a mutate function to convert the 1-6 scale to a 1-10 scale:
jared <- jared %>%
mutate(answer = case_when(
.$answer=="Never Again" ~ 0,
.$answer=="Poor" ~ 2,
.$answer=="Fair" ~ 4,
.$answer=="Average" ~ 6,
.$answer=="Good"~ 8,
.$answer=="Excellent" ~ 10))
jared$answer <- as.integer(jared$answer)
str(jared$answer)
## int [1:375] 10 8 6 2 0 10 8 6 2 0 ...
Once the answers are converted to integers, a calculation is done to find the average rating for each pizza place:
jared <- mutate(jared,Weighted_Rating = answer*votes)
(Jared_Average <- jared %>%
group_by(place) %>%
summarise(avg_score = sum(Weighted_Rating)/sum(votes)))
## # A tibble: 56 x 2
## place avg_score
## <chr> <dbl>
## 1 5 Boroughs Pizza 7.33
## 2 Artichoke Basille's Pizza 8
## 3 Arturo's 7.43
## 4 Bella Napoli 7.07
## 5 Ben's of SoHo 14th Street 4.8
## 6 Ben's of SoHo Spring Street 6.44
## 7 Big Slice Pizza 6.27
## 8 Bleecker Street Pizza 8.29
## 9 Bravo Pizza NaN
## 10 Cavallo's Pizza 7.27
## # ... with 46 more rows
(Jared_Mean <- mean(Jared_Average$avg_score, na.rm = T))
## [1] 6.802782
Two columns are dropped for being insignificant:
jared$polla_qid = NULL
jared$time = NULL
head(jared)
This calculation takes the mean of average scores omiting the 0 values
Critic_average <- mean(NA^(barstool$review_stats_critic_average_score == 0)*barstool$review_stats_critic_average_score, na.rm = TRUE)
Daves_average <- mean(NA^(barstool$review_stats_dave_average_score == 0)*barstool$review_stats_dave_average_score, na.rm = TRUE)
Community_average <- mean(NA^(barstool$review_stats_community_average_score == 0)*barstool$review_stats_community_average_score, na.rm = TRUE)
The three calculations below allow to compare to average scores found in the barstool dataset (Critic, Community and Barstool Staff (Dave)) with the scores in the jared dataset.
Pizza_Averages <- c("Dave (Barstool)","Community","Critic","Jared")
y <- c(6.622,7.0846,7.256,6.802)
barplot(y,names.arg = Pizza_Averages,xlab = "Average Review",ylab = "Rating",col = "blue", main = "Pizza Averages")
Represents the values that are in both the barstool and jared datasets
## [1] "Williamsburg Pizza" "Little Italy Pizza"
## [3] "Steve's Pizza" "Girello"
## [5] "5 Boroughs Pizza" "Saluggi's"
## [7] "Artichoke Basille's Pizza" "Bleecker Street Pizza"
## [9] "Joe's Pizza" "Champion Pizza"
## [11] "Prince Street Pizza" "Kiss My Slice"
## [13] "Arturo's" "Vinny Vincenz"
## [15] "Stella's Pizza" "Gotham Pizza"
## [17] "Highline Pizza" "Pizza Italia"
## [19] "Rivoli Pizza" "NY Pizza Suprema"
## [21] "Previti Pizza" "Rocco's Pizza Joint"
Since there weren’t any NA values in the datafiniti dataset, we simply removed duplicate values and changed the name of “province”" to “State Code” to align with the column name in the region dataset.
# Remove duplicates
datafiniti_unique <- unique(datafiniti)
datafiniti_unique # 2,285 unique observations
## # A tibble: 2,285 x 10
## name address city country province latitude longitude categories
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 Shot~ 4203 E~ Sher~ US AR 34.8 -92.2 Pizza,Res~
## 2 Sauc~ 25 E C~ Phoe~ US AZ 33.5 -112. Pizza,Piz~
## 3 Mios~ 3703 P~ Cinc~ US OH 39.1 -84.4 Restauran~
## 4 Hung~ 30495 ~ Madi~ US MI 42.5 -83.1 Pizza,Car~
## 5 Spar~ 3600 E~ Balt~ US MD 39.3 -76.6 Pizza,Ame~
## 6 La V~ 1834 E~ Berk~ US CA 37.9 -122. Pizza Pla~
## 7 Bric~ 4819 K~ Tall~ US FL 30.5 -84.2 Pizza Pla~
## 8 Carr~ 4061 2~ Gran~ US MI 42.9 -85.6 Restauran~
## 9 Dome~ 146 N ~ Glen~ US CA 34.1 -118. Pizza,Res~
## 10 Litt~ 965 N ~ El P~ US TX 31.9 -107. Restauran~
## # ... with 2,275 more rows, and 2 more variables: price_range_min <dbl>,
## # price_range_max <dbl>
# rename province
names(datafiniti_unique)[names(datafiniti_unique) == 'province'] <- 'State Code'
Our next task was to make sense of the categories column, which originally contained values in one column:
head(datafiniti_unique$categories)
## [1] "Pizza,Restaurant,American restaurants,Pizza Place,Restaurants"
## [2] "Pizza,Pizza Place,Restaurants"
## [3] "Restaurant,Pizza Place,Restaurants"
## [4] "Pizza,Carry-out food,Pizza Place,Restaurants"
## [5] "Pizza,American restaurants,Pizza Place,Pizza equipment and supplies,Restaurants"
## [6] "Pizza Place"
We accomplished this by:
## loc_id name
## 1 1 Shotgun Dans Pizza
## 2 2 Sauce Pizza Wine
## 3 3 Mios Pizzeria
## 4 4 Hungry Howies Pizza
## 5 5 Spartan Pizzeria
## 6 6 La Vals
# Separate categories
datafiniti_cat <- datafiniti_unique %>% separate(categories,
c("Cat1", "Cat2", "Cat3", "Cat4", "Cat5","Cat6",
"Cat7", "Cat8", "Cat9", "Cat10", "Cat11"), sep = ",")
## loc_id Cat1 Cat2 Cat3
## 1 1 Pizza Restaurant American restaurants
## 2 2 Pizza Pizza Place Restaurants
## 3 3 Restaurant Pizza Place Restaurants
## 4 4 Pizza Carry-out food Pizza Place
## 5 5 Pizza American restaurants Pizza Place
## 6 6 Pizza Place <NA> <NA>
## Cat4 Cat5 Cat6 Cat7 Cat8 Cat9 Cat10
## 1 Pizza Place Restaurants <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 4 Restaurants <NA> <NA> <NA> <NA> <NA> <NA>
## 5 Pizza equipment and supplies Restaurants <NA> <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# gather categories
datafiniti_cat <-
datafiniti_cat %>%
gather(Cat, Category, 2:11) %>%
arrange(loc_id)
# remove NA categories
datafiniti_cat <- datafiniti_cat[!is.na(datafiniti_cat$Category), ]
# remove Cat column
datafiniti_cat <- datafiniti_cat[ , c(1,3)]
head(datafiniti_cat)
## loc_id Category
## 1 1 Pizza
## 2 1 Restaurant
## 3 1 American restaurants
## 4 1 Pizza Place
## 5 1 Restaurants
## 11 2 Pizza
# unique categories w\ count
datafiniti_cat %>%
group_by(Category) %>%
summarize(count = n()) %>%
arrange(desc(count))
## # A tibble: 308 x 2
## Category count
## <chr> <int>
## 1 Pizza Place 2286
## 2 Restaurant 1104
## 3 Pizza 655
## 4 Restaurants 650
## 5 Italian Restaurant 225
## 6 Italian 71
## 7 Caterers 49
## 8 Italian Restaurants 47
## 9 American restaurants 45
## 10 Italian Restaurant and Pizza Place 43
## # ... with 298 more rows
# Clean \ combine categories
datafiniti_cat$Category[datafiniti_cat$Category == "Restaurants"] <- "Restaurant"
datafiniti_cat$Category[datafiniti_cat$Category == "Italian Restaurants"] <- "Italian Restaurant"
datafiniti_cat$Category[datafiniti_cat$Category == "Italian"] <- "Italian Restaurant"
datafiniti_cat$Category[datafiniti_cat$Category == "Italian restaurants"] <- "Italian Restaurant"
datafiniti_cat$Category[datafiniti_cat$Category == " and Italian restaurant"] <- "Italian Restaurant"
datafiniti_cat$Category[datafiniti_cat$Category == "Pizza"] <- "Pizza Place"
datafiniti_cat$Category[datafiniti_cat$Category == "Pizza Restaurants"] <- "Pizza Place"
datafiniti_cat$Category[datafiniti_cat$Category == "Catering"] <- "Caterers"
datafiniti_cat$Category[datafiniti_cat$Category == "American restaurants"] <- "American Restaurant"
datafiniti_cat$Category[datafiniti_cat$Category == "American Restaurants"] <- "American Restaurant"
datafiniti_cat$Category[datafiniti_cat$Category == "wich Place"] <- "Sandwich Place"
datafiniti_cat$Category[datafiniti_cat$Category == " Sandwich Place"] <- "Sandwich Place"
datafiniti_cat$Category[datafiniti_cat$Category == "American restaurants"] <- "American Restaurant"
datafiniti_cat$Category[!str_detect(datafiniti_cat$Category, "^Pizza Place and") &
str_detect(datafiniti_cat$Category, "^Pizza Place ")] <- "Pizza Place"
datafiniti_cat$Category[datafiniti_cat$Category == "Pizza Place and Rock Club Charles North"] <- "Pizza Place and Rock Club"
datafiniti_cat$Category[datafiniti_cat$Category == "Bars"] <- "Bar"
datafiniti_cat$Category[datafiniti_cat$Category == " Bar"] <- "Bar"
datafiniti_cat$Category[datafiniti_cat$Category == " and Bar"] <- "Bar"
datafiniti_cat$Category[datafiniti_cat$Category == " and Pizza Place"] <- "Pizza Place"
datafiniti_cat$Category[datafiniti_cat$Category == " and Diner"] <- "Diner"
datafiniti_cat$Category[datafiniti_cat$Category == "Ne\\\\pizza Place"] <- "Pizza Place"
datafiniti_cat$Category[datafiniti_cat$Category == "Fl\\\\pizza Place"] <- "Pizza Place"
datafiniti_cat$Category[datafiniti_cat$Category == "Fl\\\\pizza Place"] <- "Pizza Place"
#rename category
names(datafiniti_cat)[names(datafiniti_cat) == 'Category'] <- 'new_cat'
datafiniti_cat %>%
group_by(new_cat) %>%
summarize(count = n()) %>%
arrange(desc(count))
## # A tibble: 158 x 2
## new_cat count
## <chr> <int>
## 1 Pizza Place 3121
## 2 Restaurant 1754
## 3 Italian Restaurant 376
## 4 Caterers 75
## 5 American Restaurant 73
## 6 Italian Restaurant and Pizza Place 43
## 7 Bar 41
## 8 Pizza Place and Italian Restaurant 31
## 9 Take Out Restaurants 29
## 10 Sandwich Place 25
## # ... with 148 more rows
# merge w\ new categories & delete original categories
datafiniti_new <- merge(datafiniti_unique, datafiniti_cat, by = 'loc_id')
datafiniti_new <- datafiniti_new %>%
select(-categories)
head(datafiniti_new)
## loc_id name address city country State Code
## 1 1 Shotgun Dans Pizza 4203 E Kiehl Ave Sherwood US AR
## 2 1 Shotgun Dans Pizza 4203 E Kiehl Ave Sherwood US AR
## 3 1 Shotgun Dans Pizza 4203 E Kiehl Ave Sherwood US AR
## 4 1 Shotgun Dans Pizza 4203 E Kiehl Ave Sherwood US AR
## 5 1 Shotgun Dans Pizza 4203 E Kiehl Ave Sherwood US AR
## 6 2 Sauce Pizza Wine 25 E Camelback Rd Phoenix US AZ
## latitude longitude price_range_min price_range_max new_cat
## 1 34.83230 -92.1838 0 25 Pizza Place
## 2 34.83230 -92.1838 0 25 Restaurant
## 3 34.83230 -92.1838 0 25 American Restaurant
## 4 34.83230 -92.1838 0 25 Pizza Place
## 5 34.83230 -92.1838 0 25 Restaurant
## 6 33.50927 -112.0730 0 25 Pizza Place
For the datafiniti data set, we’d like to look at
Histograms / Boxplots of the prices
Look at prices by region and state
Display results on a map (potentially including barstool)
Average prices by category (potentially including barstool)
For the barstool and jared data sets we’d like to look at
Descriptive statistics for ratings
Rating by location / state
Rating by category
Compare ratings by reviewer