In the analysis of these 3 datasets, we are addressing average rating, location, and pricing information of various pizza establishments around the US. The goal is to assist the average American consumer in making informed decisions about which pizza restaurants they would like to patronize.
This analysis is based on 3 datasets with different information about pizza restaurants in the US:
Barstool - Contains critic, public and the barstool staff’s rating as well as pricing, location and geolocation of the different pizza places.
Jared - Contains ratings for New York pizza restaurants on a 6-point likert scale.
Datafiniti - Contains price range, location and keywords for pizza restaurants around the US.
Region - Contains state by region from the US Census Bureau
To learn more about the datasets used, click here or view a more detailed explanation in the Data Preparation section below.
Our approach was to divide the analysis to focus on two groups of consumers: 1) those who are price-conscious and 2) those who are interested in finding the pizza place near them with the highest rating.
library(tidyverse) # data manipulation and cleaning
library(knitr) # data dictionary tables
library(kableExtra) # table formatting
library(readxl) # importing dataset descriptions
library(leaflet) # interactive map & geolocation
The pizza data was originally compiled on September 30th, 2019 as a part of a social data project called TidyTuesday. The original intent is for R users to practice their creativity and analytical techniques, specifically with the tidyverse
package.
For those who are interested, there’s a 5-minute podcast reviewing the top entries and different visualization techniques for Week 7’s project called “Pizza Party”.
We created the regions dataset based on the definitions from the US Census Bureau.
variable | class | description |
---|---|---|
polla_qid | integer | Quiz ID |
answer | character | Answer (likert scale) |
votes | integer | Number of votes for that question/answer combo |
pollq_id | integer | Poll Question ID |
question | character | Question |
place | character | Pizza Place |
time | integer | Time of quiz |
total_votes | integer | Total number of votes for that pizza place |
percent | double | Vote percent of total for that pizza place |
variable | class | description |
---|---|---|
name | character | Pizza place name |
address1 | character | Pizza place address |
city | character | City |
zip | double | Zip |
country | character | Country |
latitude | double | Latitude |
longitude | double | Longitude |
price_level | double | Price rating (smaller = cheaper) |
provider_rating | double | Provider review score |
provider_review_count | double | Provider review count |
review_stats_all_average_score | double | Average Score |
review_stats_all_count | double | Count of all reviews |
review_stats_all_total_score | double | Review total score |
review_stats_community_average_score | double | Community average score |
review_stats_community_count | double | community review count |
review_stats_community_total_score | double | community review total score |
review_stats_critic_average_score | double | Critic average score |
review_stats_critic_count | double | Critic review count |
review_stats_critic_total_score | double | Critic total score |
review_stats_dave_average_score | double | Dave (Barstool) average score |
review_stats_dave_count | double | Dave review count |
review_stats_dave_total_score | double | Dave total score |
variable | class | description |
---|---|---|
polla_qid | integer | Quiz ID |
answer | character | Answer (likert scale) |
votes | integer | Number of votes for that question/answer combo |
pollq_id | integer | Poll Question ID |
question | character | Question |
place | character | Pizza Place |
time | integer | Time of quiz |
total_votes | integer | Total number of votes for that pizza place |
percent | double | Vote percent of total for that pizza place |
colSums(is.na(barstool))
## name address1
## 0 0
## city zip
## 0 0
## country latitude
## 0 2
## longitude price_level
## 2 0
## provider_rating provider_review_count
## 0 0
## review_stats_all_average_score review_stats_all_count
## 0 0
## review_stats_all_total_score review_stats_community_average_score
## 0 0
## review_stats_community_count review_stats_community_total_score
## 0 0
## review_stats_critic_average_score review_stats_critic_count
## 0 0
## review_stats_critic_total_score review_stats_dave_average_score
## 0 0
## review_stats_dave_count review_stats_dave_total_score
## 0 0
colSums(is.na(jared))
## polla_qid answer votes pollq_id question place
## 0 0 0 0 0 0
## time total_votes percent
## 0 0 5
colSums(is.na(datafiniti))
## name address city country
## 0 0 0 0
## province latitude longitude categories
## 0 0 0 0
## price_range_min price_range_max
## 0 0
colSums(is.na(region))
## State State Code Region Division
## 0 0 0 0
The jared
dataset pizza ratings are on a 1 to 6 likert scale. However, barstool is on a 1-10 scale. We used a mutate
function to convert the 1-6 scale to a 1-10 scale:
jared <- jared %>%
mutate(answer = case_when(
.$answer=="Never Again" ~ 0,
.$answer=="Poor" ~ 2,
.$answer=="Fair" ~ 4,
.$answer=="Average" ~ 6,
.$answer=="Good"~ 8,
.$answer=="Excellent" ~ 10))
jared$answer <- as.integer(jared$answer)
str(jared$answer)
## int [1:375] 10 8 6 2 0 10 8 6 2 0 ...
Once the answers are converted to integers, a calculation is done to find the average rating for each pizza place:
jared <- mutate(jared,Weighted_Rating = answer*votes)
(Jared_Average <- jared %>%
group_by(place) %>%
summarise(avg_score = sum(Weighted_Rating)/sum(votes)))
## # A tibble: 56 x 2
## place avg_score
## <chr> <dbl>
## 1 5 Boroughs Pizza 7.33
## 2 Artichoke Basille's Pizza 8
## 3 Arturo's 7.43
## 4 Bella Napoli 7.07
## 5 Ben's of SoHo 14th Street 4.8
## 6 Ben's of SoHo Spring Street 6.44
## 7 Big Slice Pizza 6.27
## 8 Bleecker Street Pizza 8.29
## 9 Bravo Pizza NaN
## 10 Cavallo's Pizza 7.27
## # ... with 46 more rows
(Jared_Mean <- mean(Jared_Average$avg_score, na.rm = T))
## [1] 6.802782
Two columns are dropped for being insignificant:
jared$polla_qid = NULL
jared$time = NULL
head(jared)
This calculation takes the mean of average scores omiting the 0 values
Critic_average <- mean(NA^(barstool$review_stats_critic_average_score == 0)*barstool$review_stats_critic_average_score, na.rm = TRUE)
Daves_average <- mean(NA^(barstool$review_stats_dave_average_score == 0)*barstool$review_stats_dave_average_score, na.rm = TRUE)
Community_average <- mean(NA^(barstool$review_stats_community_average_score == 0)*barstool$review_stats_community_average_score, na.rm = TRUE)
The three calculations below allow to compare to average scores found in the barstool
dataset (Critic, Community and Barstool Staff (Dave)) with the scores in the jared
dataset.
Represents the values that are in both the barstool
and jared
datasets
## [1] "Williamsburg Pizza" "Little Italy Pizza"
## [3] "Steve's Pizza" "Girello"
## [5] "5 Boroughs Pizza" "Saluggi's"
## [7] "Artichoke Basille's Pizza" "Bleecker Street Pizza"
## [9] "Joe's Pizza" "Champion Pizza"
## [11] "Prince Street Pizza" "Kiss My Slice"
## [13] "Arturo's" "Vinny Vincenz"
## [15] "Stella's Pizza" "Gotham Pizza"
## [17] "Highline Pizza" "Pizza Italia"
## [19] "Rivoli Pizza" "NY Pizza Suprema"
## [21] "Previti Pizza" "Rocco's Pizza Joint"
Since there weren’t any NA values in the datafiniti
dataset, we simply removed duplicate values and changed the name of “province”" to “State Code” to align with the column name in the region
dataset.
# Remove duplicates
datafiniti_unique <- unique(datafiniti)
datafiniti_unique # 2,285 unique observations
## # A tibble: 2,285 x 10
## name address city country province latitude longitude categories
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 Shot~ 4203 E~ Sher~ US AR 34.8 -92.2 Pizza,Res~
## 2 Sauc~ 25 E C~ Phoe~ US AZ 33.5 -112. Pizza,Piz~
## 3 Mios~ 3703 P~ Cinc~ US OH 39.1 -84.4 Restauran~
## 4 Hung~ 30495 ~ Madi~ US MI 42.5 -83.1 Pizza,Car~
## 5 Spar~ 3600 E~ Balt~ US MD 39.3 -76.6 Pizza,Ame~
## 6 La V~ 1834 E~ Berk~ US CA 37.9 -122. Pizza Pla~
## 7 Bric~ 4819 K~ Tall~ US FL 30.5 -84.2 Pizza Pla~
## 8 Carr~ 4061 2~ Gran~ US MI 42.9 -85.6 Restauran~
## 9 Dome~ 146 N ~ Glen~ US CA 34.1 -118. Pizza,Res~
## 10 Litt~ 965 N ~ El P~ US TX 31.9 -107. Restauran~
## # ... with 2,275 more rows, and 2 more variables: price_range_min <dbl>,
## # price_range_max <dbl>
# rename province
names(datafiniti_unique)[names(datafiniti_unique) == 'province'] <- 'State Code'
Our next task was to make sense of the categories column, which originally contained values in one column:
head(datafiniti_unique$categories)
## [1] "Pizza,Restaurant,American restaurants,Pizza Place,Restaurants"
## [2] "Pizza,Pizza Place,Restaurants"
## [3] "Restaurant,Pizza Place,Restaurants"
## [4] "Pizza,Carry-out food,Pizza Place,Restaurants"
## [5] "Pizza,American restaurants,Pizza Place,Pizza equipment and supplies,Restaurants"
## [6] "Pizza Place"
We accomplished this by:
## loc_id name
## 1 1 Shotgun Dans Pizza
## 2 2 Sauce Pizza Wine
## 3 3 Mios Pizzeria
## 4 4 Hungry Howies Pizza
## 5 5 Spartan Pizzeria
## 6 6 La Vals
# Separate categories
datafiniti_cat <- datafiniti_unique %>% separate(categories, c("Cat1", "Cat2", "Cat3", "Cat4","Cat5",
"Cat6", "Cat7", "Cat8", "Cat9", "Cat10",
"Cat11"), sep = ",")
## loc_id Cat1 Cat2 Cat3
## 1 1 Pizza Restaurant American restaurants
## 2 2 Pizza Pizza Place Restaurants
## 3 3 Restaurant Pizza Place Restaurants
## 4 4 Pizza Carry-out food Pizza Place
## 5 5 Pizza American restaurants Pizza Place
## 6 6 Pizza Place <NA> <NA>
## Cat4 Cat5 Cat6 Cat7 Cat8 Cat9 Cat10
## 1 Pizza Place Restaurants <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 4 Restaurants <NA> <NA> <NA> <NA> <NA> <NA>
## 5 Pizza equipment and supplies Restaurants <NA> <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## loc_id Category
## 1 1 Pizza
## 2 1 Restaurant
## 3 1 American restaurants
## 4 1 Pizza Place
## 5 1 Restaurants
## 11 2 Pizza
# unique categories w/ count
datafiniti_cat %>%
group_by(Category) %>%
summarize(count = n()) %>%
arrange(desc(count))
## # A tibble: 308 x 2
## Category count
## <chr> <int>
## 1 Pizza Place 2286
## 2 Restaurant 1104
## 3 Pizza 655
## 4 Restaurants 650
## 5 Italian Restaurant 225
## 6 Italian 71
## 7 Caterers 49
## 8 Italian Restaurants 47
## 9 American restaurants 45
## 10 Italian Restaurant and Pizza Place 43
## # ... with 298 more rows
## # A tibble: 312 x 2
## Category count
## <chr> <int>
## 1 Pizza Place 2286
## 2 Restaurant 1104
## 3 Pizza 655
## 4 Restaurants 650
## 5 Italian Restaurant 225
## 6 "Pizza Place " 81
## 7 Italian 71
## 8 " Pizza Place" 53
## 9 Caterers 49
## 10 Italian Restaurants 47
## # ... with 302 more rows
Clean_cat | count |
---|---|
Pizza Place | 3228 |
Restaurant | 1821 |
Italian | 462 |
Other | 153 |
American | 93 |
Bar / Pub | 83 |
Caterer | 75 |
Pizza Restaurant | 46 |
Sandwich Place | 37 |
Delivery / Carry-out | 23 |
Karaoke | 10 |
loc_id | Category | Clean_cat |
---|---|---|
1 | Pizza | Pizza Place |
1 | Restaurant | Restaurant |
1 | American restaurants | American |
1 | Pizza Place | Pizza Place |
1 | Restaurants | Restaurant |
2 | Pizza | Pizza Place |
2 | Pizza Place | Pizza Place |
2 | Restaurants | Restaurant |
3 | Restaurant | Restaurant |
3 | Pizza Place | Pizza Place |
# merge w\ new categories & delete original categories
datafiniti_new <- merge(datafiniti_unique, datafiniti_cat, by = 'loc_id')
datafiniti_new <- datafiniti_new %>%
select(-c(categories, Category))
datafiniti_new <- unique(datafiniti_new)
kable(head(select(datafiniti_new, c(loc_id, name, new_cat)), 10)) %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
loc_id | name | new_cat | |
---|---|---|---|
1 | 1 | Shotgun Dans Pizza | Pizza Place |
2 | 1 | Shotgun Dans Pizza | Restaurant |
3 | 1 | Shotgun Dans Pizza | American |
6 | 2 | Sauce Pizza Wine | Pizza Place |
8 | 2 | Sauce Pizza Wine | Restaurant |
9 | 3 | Mios Pizzeria | Restaurant |
10 | 3 | Mios Pizzeria | Pizza Place |
12 | 4 | Hungry Howies Pizza | Pizza Place |
13 | 4 | Hungry Howies Pizza | Delivery / Carry-out |
15 | 4 | Hungry Howies Pizza | Restaurant |
First, we looked at a brief overview of prices by category:
datafiniti_new %>%
group_by(new_cat) %>%
summarize(Avg_Min_Price = mean(price_range_min), Avg_Max_Price = mean(price_range_max)) %>%
mutate(Range = Avg_Max_Price - Avg_Min_Price) %>%
arrange(Avg_Min_Price)
## # A tibble: 11 x 4
## new_cat Avg_Min_Price Avg_Max_Price Range
## <chr> <dbl> <dbl> <dbl>
## 1 Sandwich Place 2 26.2 24.2
## 2 Pizza Restaurant 2.52 26.6 24.1
## 3 Caterer 4.3 27.6 23.3
## 4 Restaurant 4.41 27.6 23.2
## 5 Delivery / Carry-out 4.55 27.7 23.2
## 6 Pizza Place 4.67 27.8 23.1
## 7 Karaoke 5 28 23
## 8 Other 6.16 28.6 22.4
## 9 American 7.57 29.4 21.8
## 10 Italian 9.74 30.9 21.2
## 11 Bar / Pub 10.8 31.5 20.7
Next, we delved a little deeper into the pricing structure:
# Price Exploration -------------------------------------------------------
# min \ max summary
summary(datafiniti_unique$price_range_min) # 3rd quartile is still 0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 4.67 0.00 50.00
summary(datafiniti_unique$price_range_max) # ranges from 7 to 55
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.0 25.0 25.0 27.8 25.0 55.0
# create spread variable
datafiniti_unique <- datafiniti_unique %>%
mutate(Range = price_range_max - price_range_min)
datafiniti_new <- datafiniti_new %>%
mutate(Range = price_range_max - price_range_min)
summary(datafiniti_unique$Range)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 25.00 25.00 23.13 25.00 34.00
summary(datafiniti_new$Range)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 25.00 25.00 22.95 25.00 34.00
# histogram for each
datafiniti_unique %>%
ggplot(aes(x = price_range_min)) +
geom_histogram(binwidth = 2) +
coord_cartesian(xlim = c(0,40))
datafiniti_unique %>%
ggplot(aes(x = price_range_max)) +
geom_histogram(binwidth = 2) +
coord_cartesian(xlim = c(20,55))
datafiniti_unique %>%
ggplot(aes(x = Range)) +
geom_histogram(binwidth = 2) +
coord_cartesian(xlim = c(10,30))
# boxplot for each
datafiniti_new %>%
ggplot(aes(x = new_cat, y = price_range_min)) +
geom_boxplot() +
coord_flip()
datafiniti_new %>%
ggplot(aes(x = new_cat, y = price_range_max)) +
geom_boxplot() +
coord_flip()
datafiniti_new %>%
ggplot(aes(x = new_cat, y = Range)) +
geom_boxplot() +
coord_flip()
# Price by Region / State ------------------------------------------------------------------
# add Region to datafiniti_unique
datafiniti_unique <-
datafiniti_unique %>%
merge(region, by = "State Code") %>%
select(-(Division))
# Min / max by region
datafiniti_unique %>%
group_by(Region) %>%
summarize(Avg_min_price = mean(price_range_min),
Avg_max_price = mean(price_range_max))
## # A tibble: 4 x 3
## Region Avg_min_price Avg_max_price
## <chr> <dbl> <dbl>
## 1 Midwest 4.35 27.6
## 2 Northeast 3.53 27.0
## 3 South 5.14 28.1
## 4 West 5.75 28.5
# Min / max by state
datafiniti_unique %>%
group_by(State) %>%
summarize(Avg_max_price = mean(price_range_max)) %>%
arrange(desc(Avg_max_price))
## # A tibble: 44 x 2
## State Avg_max_price
## <chr> <dbl>
## 1 Connecticut 40
## 2 South Dakota 32.5
## 3 Iowa 31
## 4 Alabama 30
## 5 Nebraska 30
## 6 Arizona 29.9
## 7 West Virginia 29.3
## 8 Maryland 29.2
## 9 Louisiana 29.1
## 10 Oregon 29.1
## # ... with 34 more rows
datafiniti_unique %>%
group_by(State) %>%
summarize(Avg_min_price = mean(price_range_min)) %>%
arrange(Avg_min_price)
## # A tibble: 44 x 2
## State Avg_min_price
## <chr> <dbl>
## 1 Alaska 0
## 2 Delaware 0
## 3 Hawaii 0
## 4 Montana 0
## 5 New Jersey 0
## 6 North Dakota 0
## 7 Nevada 1.25
## 8 South Carolina 1.85
## 9 Arkansas 1.92
## 10 Kansas 2.08
## # ... with 34 more rows
Locations of the restaurants in the barstool
dataset
m <- leaflet(barstool) %>% addTiles()
m %>% addCircles(lng = ~ barstool$longitude, lat = ~ barstool$latitude, popup = barstool$nameL, weight = 8, color = "#fb3004", stroke = TRUE)
This map shows the barstool restaurants which have been reviewed. The circles make it easy to portray a visual of where most reviews are from. For example, zooming in on New York will show hundreds of restaurants.
ggplot(data = datafiniti, aes(x=price_range_max,y=price_range_min)) +
geom_point(color = "blue", size=2,shape=17)
The three calculations below allow to compare to average scores found in the barstool
dataset (Critic, Community and Barstool Staff (Dave)) with the scores in the jared
dataset.
Pizza_Averages <- c("Dave (Barstool)","Community","Critic","Jared")
y <- c(6.622,7.0846,7.256,6.802)
barplot(y,names.arg = Pizza_Averages,xlab = "Average Review",ylab = "Rating",col = "blue", main = "Pizza Averages")
Our goal was to educate consumers on how they want to spend their money in the pizza industry. We looked at ratings between the barstool and jared datasets, which concluded that the jared pizza average was higher than barstool.The barstool map was shown to see where most restaraunts are. In the datafiniti set we looked at price range and location, which concludes the West has the highest price value. In terms of states, Conneticut has the highest average max price. The median price range is $25.