Problem Statement:
The objective of this project is to recommend top Pizzerias in the New York City and provide analysis on Pizza Restaurants in NYC based on it’s location and price range.
Approach:
Thanks to the #tidytuesday initiative we were able to get our hands on the pizza datasets viz Pizza_Jared, Pizza_Barstool and Pizza_Datafiniti.
After an initial analysis of the above dataset, we found that the maximum number of Pizzerias are in New York City and its neighbourhood. Hence, we focused on New York City data and consolidated all the available ratings into a single metric to recommend top Pizza Places in New York. We leveraged the geo-location and price range information present in Barstool’s to categorize the Pizza Places according to its location and price range.
We used the following packages to arrive at our recommendations:
library(tidyverse) # Used in data processing and data transformation as well as for data visualization
library(readr) # Used for importing data CSV files
library(GGally) # Used for pairwise plots
library(DT) # Used for displaying table in HTML
library(leaflet) # Used to add Interactive Map
library(glue) # Used for Concatenating Name and Count in Graphs
These data sets were part of #tidytuesday initiative. More information about it can be found here.
There are three data sets:
Pizza_Jared: Jared’s data is from top NY pizza restaurants, with a 6-point likert scale survey on ratings.
Pizza_Barstool: The Barstool sports dataset has critic, public, and the Barstool Staff’s rating as well as pricing, location, and geo-location.
Pizza_Datafiniti: The Dafiniti includes 10000 pizza places, their price ranges and location.
First, we import all the datasets in R. Please find below the code for importing all the datasets:
pizza_jared <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-01/pizza_jared.csv")
pizza_barstool <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-01/pizza_barstool.csv")
pizza_datafiniti <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-01/pizza_datafiniti.csv")
Importing US ZIPs to pull State for observations where state information is not available.
The ZIPCODE-STATE mapping was downloaded from simplemaps.com.
uszips <- readr::read_csv('uszips.csv')
Importing ZIP corresponding to boroughs in New York City. We will be using these to present borough wise analysis. Information of ZIP-borough was scraped from the following website: nycbynatives.com
nycboroughs <- readr::read_csv('nycboroughs.csv')
In Pizza Barstool dataset we found that:
zip column has some entries which have 4-digit zip codes, hence it is needed to front pad these zip codes with ‘0’ in order to pull the state information from uszips datasetlatitude and longitude columns. It can be pulled from uszips dataset based on the zip code# Checking the top 10 rows of the dataset
head(pizza_barstool)
#structure of the dataset
str(pizza_barstool)
#High level Summary of data
summary(pizza_barstool)
#Checking missing values in each column
colSums(is.na(pizza_barstool))
# Checking Record having missing value
pizza_barstool[rowSums(is.na(pizza_barstool)) > 0,]
# Checking Duplicates in data
count(unique(pizza_barstool))
# Checking count of unique observation in each variable
pizza_barstool %>%
summarise_each(funs(n_distinct))
# Checking Count at Name and address level combined
count(unique(pizza_barstool[c('name','address1')]))
# Checking unique pizza places
length(unique(tolower(pizza_barstool$name)))
## [1] 451
# Checking cities with maximum pizza places
city_table <- as.data.frame(table(pizza_barstool$city))
city_table_top <- (city_table %>% top_n(10))
ggplot(city_table_top,
aes(x = (reorder(city_table_top$Var1,
city_table_top$Freq)),
y = city_table_top$Freq)) +
geom_bar(stat = "identity") +
labs(title = "Which City have maximum number of Pizza Places?", subtitle = "New York City has maximum number of Pizza Places", x = "City", y = "Number of Pizza Places") +
coord_flip()
Pizza_Datafiniti is a repository of Pizza Places and contains information regarding address and price range of Pizza Places
# Checking the top 10 rows of the dataset
head(pizza_datafiniti)
#structure of the dataset
str(pizza_datafiniti)
#High level Summary of data
summary(pizza_datafiniti)
#Checking missing values in each column
colSums(is.na(pizza_datafiniti))
# Checking Duplicates in data
count(unique(pizza_datafiniti))
# Removing Duplicates
pizza_datafiniti_2 <- unique(pizza_datafiniti)
#qc
count(pizza_datafiniti_2)
# Checking count of unique observations in each variable
pizza_datafiniti_2 %>%
summarise_each(funs(n_distinct))
# Checking unique pizza places
length(unique(tolower(pizza_datafiniti_2$name)))
## [1] 1817
# Checking cities with maximum pizza places
city_table <- as.data.frame(table(pizza_datafiniti_2$city))
city_table_top <- (city_table %>% top_n(10))
ggplot(city_table_top, aes(x = reorder(city_table_top$Var1, city_table_top$Freq), y = city_table_top$Freq)) + geom_bar(stat = "identity") + labs(title = "Which City have maximum number of Pizza Places?", subtitle = "New York City has maximum number of Pizza Places", x = "City", y = "Number of Pizza Places") + expand_limits(y = 100) +
coord_flip()
After initial data exploration process with Pizza_Jared dataset, we found that:
percent because no survey rating was available for a place called Bravo PizzaPizza_Datafiniti dataset# Checking the top 10 rows of the dataset
head(pizza_jared)
#structure of the dataset
str(pizza_jared)
#High level Summary of data
summary(pizza_jared)
#Checking missing values in each column
colSums(is.na(pizza_jared))
# Checking Record having missing value
pizza_jared[rowSums(is.na(pizza_jared)) > 0,]
# Checking Duplicates in data
count(unique(pizza_jared))
polla_qid{Excellent,Good,Average,Fair,Poor,Never Again} having 75 records each, out of which Fair have only one record and Never Again has 74 recordspolla_qid as keypolla_qid, the number of unique places is 56. We cannot arrive at the conclusion that duplicate places are the same or different as we don’t have any information about their addresses.# Checking count of unique observations in each variable
pizza_jared %>%
summarise_each(funs(n_distinct))
# Checking total count for each of the answers
table(pizza_jared$answer)
##
## Average Excellent Fair Good Never Again Poor
## 75 75 1 75 74 75
pizza_jared dataset and pulling address and pizza range from pizza_datafiniti dataset:# Dropping Percent column and spreading the data
pizza_jared_2 <- (pizza_jared %>%
subset(select = -percent) %>%
spread(answer,votes))
# Pulling the values form Datafiniti
pizza_jared_3 <- merge(x = pizza_jared_2, y = pizza_datafiniti_2, by.x = "place", by.y = "name", all.x = TRUE)
pizza_jared dataset, hence we will be dropping pizza_jared dataset from our analysis# Checking Missing Values in each column
colSums(is.na(pizza_jared_3))
## place polla_qid pollq_id question time
## 0 0 0 0 0
## total_votes Average Excellent Fair Good
## 0 0 0 74 0
## Never Again Poor address city country
## 1 0 65 65 65
## province latitude longitude categories price_range_min
## 65 65 65 65 65
## price_range_max
## 65
# Displaying data which does not missing value in address
pizza_jared_3[!is.na(pizza_jared_3$address),]
After checking unique values in uszips, we mapped State Name from uszips dataset to pizza_barstool dataset. We were unable to map the state names for 3 records as we did not find a zip-state mapping for them.
#checking the unique values in UsZips data
unique(uszips$zip)
# Keeping only unique values in nycboroughs
nycboroughs <- unique(nycboroughs)
#Storing only zip, State Name and County Name in new dataframe
uszips_1 <- uszips[ , c("zip" , "state_name" , "county_name")]
# Converting zip to numeric format as it is in Pizza Barstool data
uszips_1$zip <- as.numeric(uszips_1$zip)
# Pulling state name from USZIP data to Barstool dataset
pizza_barstool_1 <- merge(x = pizza_barstool, y = uszips_1, by.x = "zip", by.y = "zip", all.x = TRUE)
pizza_barstool_1 <- merge(x = pizza_barstool_1, y = nycboroughs, by.x = "zip", by.y = "zip", all.x = TRUE)
# Checking count of missing values in the merged data
colSums(is.na(pizza_barstool_1))
# Checking missing state name records
pizza_barstool_1[is.na(pizza_barstool_1$state_name),]
As maximum records are from New York City, we have created a dataset for 286 pizza places which are near the New York City from Pizza_Barstool applying filters shown in code below:
# Sorting the Barstool data by New York city , Latitudes & Longitudes
pizza_barstool_2 <- pizza_barstool_1 %>% filter(state_name == "New York" & between(latitude,39.5,41.5) & between(longitude,-74.5,-72.5))
Apart from having Name, Address and Price Range of Pizza Places Barstool’s data had following different types of ratings:
There is also a rating which is a weighted average of Community, Critic and Dave rating. In our EDA phase, we will analyze the correlation between different types of ratings and arrive at a consolidated rating which will be a good reflection of all the other ratings.
# As all records are from New York we will only keep drop city, state_name, country, latitude and longitude. We will keep county name as we will use it afterwards when dividing whole city into different counties
pizza_barstool_3 <- subset(pizza_barstool_2,select = -c(city,country,state_name))
# Rounding off the Ratings to two decimal places
pizza_barstool_3[, c(7,9,12,15,18)] <- round(pizza_barstool_3[, c(7,9,12,15,18)], digits = 2)
# Checking Summary Stats of Numerical columns
summary(pizza_barstool_3)
# Multiplying Provider Rating by 2
pizza_barstool_3$provider_rating <- 2*pizza_barstool_3$provider_rating
# Removing "Review Stats" from column name in order to increase readability
pizza_barstool_3 <- rename_all(pizza_barstool_3, ~str_remove(.,"review_stats_"))
Here is the preview of final Pizza Barstool dataset which we will use for our analysis
DT::datatable(pizza_barstool_3) # requires DT package
# Scatter Plot Matrix for all the ratings
ggpairs(pizza_barstool_3, columns = c(7,9,12,15,18)) + ggtitle("How are the ratings related to each other?")
The number of reviews in Provider Rating are higher than that of all average score, hence we will only compare reviews for Pizza Places which have count of all reviews greater than 30. We found that:
Moving forward, we would like to use All Average Score for our analysis:
pizza_barstool_3 %>%
filter(all_count >= 30) %>%
ggplot(aes(provider_rating, all_average_score)) +
geom_point() +
geom_abline(color = "red") +
geom_smooth(method = "lm") +
labs(#size = "# of community reviews",
x = "Provider Rating",
y = "All average score",
title = "Does Provider Rating agree with all average score?")
Top 20 Pizza places in the New York City which have more than 30 review counts:
pizza_barstool_3 %>%
filter(all_count >= 30) %>%
top_n(20, all_average_score) %>%
mutate(name = fct_reorder(name, all_average_score)) %>%
ggplot(aes(all_average_score, name, size = all_count)) +
geom_point() +
labs(x = "Average rating",
y = "",
size = "# of reviews",
title = "Top 20 Pizza Places according to Barstool Sports ratings",
subtitle = "Only places with at least 30 reviews")
You can locate the top 20 Pizza Places in New York City in the following map:
top_20_pizza <- pizza_barstool_3 %>%
filter(all_count >= 30) %>%
top_n(20, all_average_score)
m <- leaflet(top_20_pizza) %>% addTiles()
m %>% addMarkers(lng = ~ top_20_pizza$longitude, lat = ~ top_20_pizza$latitude, popup = paste("Name: " ,top_20_pizza$name , "<br>","Price_Level: ", top_20_pizza$price_level, "<br>", "Rating: ", top_20_pizza$all_average_score))
Manhattan has maximum number of Pizza Places, whereas Brooklyn have highly rated Pizza Places
pizza_barstool_3 %>%
filter(Boroughs != 'NA') %>%
add_count(Boroughs) %>%
mutate(Boroughs = glue::glue("{ Boroughs } ({ n })")) %>%
ggplot(aes(Boroughs, all_average_score)) +
geom_boxplot() +
geom_jitter() +
labs(title = "Do pizza ratings differ across Boroughs?")
datatable(pizza_barstool_3 %>%
filter(all_count >= 30, Boroughs == "Brooklyn") %>%
top_n(5, all_average_score) %>%
arrange(order(-all_average_score)) %>%
select(c("name","address1","price_level","all_average_score")))
datatable(pizza_barstool_3 %>%
filter(all_count >= 30, Boroughs == "Manhattan") %>%
top_n(5, all_average_score) %>%
arrange(order(-all_average_score)) %>%
select(c("name","address1","price_level","all_average_score")))
pizza_barstool_3 %>%
filter(price_level != 'NA') %>%
add_count(price_level) %>%
mutate(price_level = glue::glue("{ price_level } ({ n })")) %>%
ggplot(aes(price_level, all_average_score)) +
geom_boxplot() +
geom_jitter() +
labs(title = "Do pizza ratings differ across different Price Levels?")
You can find below the Map of all the Pizza Places in New York City along with their All Average Score and Price Level
m <- leaflet(pizza_barstool_3) %>% addTiles()
m %>% addMarkers(lng = ~ pizza_barstool_3$longitude, lat = ~ pizza_barstool_3$latitude, popup = paste("Name: " ,pizza_barstool_3$name , "<br>","Price_Level: ", pizza_barstool_3$price_level, "<br>", "Rating: ", pizza_barstool_3$all_average_score))
Pizza Datafiniti dataset, giving information on Location and Price Level across the US