Problem Statement:
The objective of this project is to recommend top Pizzerias in different price ranges in the vicinity of New York City. User can search the locality and select the Price range to view the top recommendations.
Approach:
Thanks to the #tidytuesday initiative we were able to get our hands on the pizza datasets viz Pizza_Jared, Pizza_Barstool and Pizza_Datafiniti.
After an initial analysis of the above dataset, we found that the maximum number of Pizzerias are in New York City and its neighbourhood. Hence, we focused on New York City data and consolidated all the available ratings into a single metric to recommend top Pizza Places in New York. We leveraged the geo-location and price range information present in Barstool’s and Datafiniti’s data to categorize the Pizza Places according to its location and price range.
We used the following packages to arrive at our recommendations:
library(tidyverse) # Used in data processing and data transformation as well as for data visualization
library(readr) # Used for importing data CSV files
library(GGally) # Used for pairwise plots
These data sets were part of #tidytuesday initiative. More information about it can be found here.
There are three data sets:
Pizza_Jared: Jared’s data is from top NY pizza restaurants, with a 6 point likert scale survey on ratings.
Pizza_Barstool: The Barstool sports dataset has critic, public, and the Barstool Staff’s rating as well as pricing, location, and geo-location.
Pizza_Datafiniti: The Dafiniti includes 10000 pizza places, thier price ranges and location.
First, we import all the datasets in R. Please find below the code for importing all the datasets:
pizza_jared <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-01/pizza_jared.csv")
pizza_barstool <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-01/pizza_barstool.csv")
pizza_datafiniti <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-01/pizza_datafiniti.csv")
Importing US ZIPs to pull State for observations where state information is not available.
The ZIPCODE-STATE mapping was downloaded from simplemaps.com.
uszips <- readr::read_csv('uszips.csv')
In Pizza Barstool dataset we found that:
zip column has some entries which have 4 digit zipcodes, hence it is needed to front pad these zip codes with ‘0’ in order to pull the state information from uszips datasetlatitude and longitude columns. It can be pulled from uszips dataset based on the zip code# Checking the top 10 rows of the dataset
head(pizza_barstool)
#structure of the datset
str(pizza_barstool)
#High level Summary of data
summary(pizza_barstool)
#Checking missing values in each column
colSums(is.na(pizza_barstool))
# Checking Record having missing value
pizza_barstool[rowSums(is.na(pizza_barstool)) > 0,]
# Checking Duplicates in data
count(unique(pizza_barstool))
# Checking count of unique observation in each variable
pizza_barstool %>%
summarise_each(funs(n_distinct))
## # A tibble: 1 x 22
## name address1 city zip country latitude longitude price_level
## <int> <int> <int> <int> <int> <int> <int> <int>
## 1 452 460 99 176 1 458 456 4
## # ... with 14 more variables: provider_rating <int>,
## # provider_review_count <int>, review_stats_all_average_score <int>,
## # review_stats_all_count <int>, review_stats_all_total_score <int>,
## # review_stats_community_average_score <int>,
## # review_stats_community_count <int>,
## # review_stats_community_total_score <int>,
## # review_stats_critic_average_score <int>,
## # review_stats_critic_count <int>,
## # review_stats_critic_total_score <int>,
## # review_stats_dave_average_score <int>, review_stats_dave_count <int>,
## # review_stats_dave_total_score <int>
# Checking Count at Name and address level combined
count(unique(pizza_barstool[c('name','address1')]))
## # A tibble: 1 x 1
## n
## <int>
## 1 463
# Checking unique pizza places
length(unique(tolower(pizza_barstool$name)))
## [1] 451
# Checking cities with maximum pizza places
city_table <- as.data.frame(table(pizza_barstool$city))
city_table_top <- (city_table %>% top_n(10))
ggplot(city_table_top, aes(x = reorder(city_table_top$Var1, -city_table_top$Freq), y = city_table_top$Freq)) + geom_bar(stat = "identity") + labs(title = "Count of Pizza Places per city", x = "City", y = "Number of Pizza Places")
Pizza_Datafiniti is a repository of Pizza Places and contains information regarding address and price range of Pizza Places
# Checking the top 10 rows of the dataset
head(pizza_datafiniti)
#structure of the datset
str(pizza_datafiniti)
#High level Summary of data
summary(pizza_datafiniti)
#Checking missing values in each column
colSums(is.na(pizza_datafiniti))
# Checking Duplicates in data
count(unique(pizza_datafiniti))
# Removing Duplicates
pizza_datafiniti_2 <- unique(pizza_datafiniti)
#qc
count(pizza_datafiniti_2)
# Checking count of unique observations in each variable
pizza_datafiniti_2 %>%
summarise_each(funs(n_distinct))
## # A tibble: 1 x 10
## name address city country province latitude longitude categories
## <int> <int> <int> <int> <int> <int> <int> <int>
## 1 1827 2278 1028 1 44 2284 2284 456
## # ... with 2 more variables: price_range_min <int>, price_range_max <int>
# Checking unique pizza places
length(unique(tolower(pizza_datafiniti_2$name)))
## [1] 1817
# Checking cities with maximum pizza places
city_table <- as.data.frame(table(pizza_datafiniti_2$city))
city_table_top <- (city_table %>% top_n(10))
ggplot(city_table_top, aes(x = reorder(city_table_top$Var1, -city_table_top$Freq), y = city_table_top$Freq)) + geom_bar(stat = "identity") + labs(title = "Count of Pizza Places per city", x = "City", y = "Number of Pizza Places") + expand_limits(y = 100)
After initial data exploration process with Pizza_Jared dataset, we found that:
percent because no survey rating was available for a place called Bravo PizzaPizza_Datafiniti dataset# Checking the top 10 rows of the dataset
head(pizza_jared)
#structure of the datset
str(pizza_jared)
#High level Summary of data
summary(pizza_jared)
#Checking missing values in each column
colSums(is.na(pizza_jared))
# Checking Record having missing value
pizza_jared[rowSums(is.na(pizza_jared)) > 0,]
# Checking Duplicates in data
count(unique(pizza_jared))
polla_qid{Excellent,Good,Average,Fair,Poor,Never Again} having 75 records each, out of which Fair have only one record and Never Again has 74 recordspolla_qid as keypolla_qid, the number of unique places are 56. We cannot arrive at the conclusion that duplicate places are the same or different as we don’t have any information about their addresses.# Checking count of unique observations in each variable
pizza_jared %>%
summarise_each(funs(n_distinct))
## # A tibble: 1 x 9
## polla_qid answer votes pollq_id question place time total_votes percent
## <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 75 6 20 75 56 56 75 30 94
# Checking total count for each of the answers
table(pizza_jared$answer)
##
## Average Excellent Fair Good Never Again Poor
## 75 75 1 75 74 75
pizza_jared dataset and pulling address and pizza range from pizza_datafiniti dataset:# Dropping Percent column and spreading the data
pizza_jared_2 <- (pizza_jared %>%
subset(select = -percent) %>%
spread(answer,votes))
# Pulling the values form Datafiniti
pizza_jared_3 <- merge(x = pizza_jared_2, y = pizza_datafiniti_2, by.x = "place", by.y = "name", all.x = TRUE)
pizza_jared dataset, hence we will be dropping pizza_jared dataset from our analysis# Checking Missing Values in each column
colSums(is.na(pizza_jared_3))
## place polla_qid pollq_id question
## 0 0 0 0
## time total_votes Average Excellent
## 0 0 0 0
## Fair Good Never Again Poor
## 74 0 1 0
## address city country province
## 65 65 65 65
## latitude longitude categories price_range_min
## 65 65 65 65
## price_range_max
## 65
# Displaying data which does not missing value in address
pizza_jared_3[!is.na(pizza_jared_3$address),]
## place polla_qid pollq_id
## 7 Big Slice Pizza 58 58
## 9 Bravo Pizza 29 29
## 13 Dough Boys 61 61
## 22 Gotham Pizza 63 63
## 28 Joe's Pizza 10 10
## 33 Little Italy Pizza 36 36
## 34 Little Italy Pizza 60 60
## 39 New York Pizza Suprema 73 73
## 46 Pizza 33 21 21
## 70 Steve's Pizza 76 76
## question time total_votes Average
## 7 How was Big Slice Pizza? 1516811929 15 5
## 9 How was Bravo Pizza? 1422457179 0 0
## 13 How was Dough Boys? 1523891285 13 7
## 22 How was Gotham Pizza? 1529601987 17 3
## 28 How was Joe's Pizza? 1366908424 17 8
## 33 How was Little Italy Pizza? 1453292409 13 5
## 34 How was Little Italy Pizza? 1520532907 21 4
## 39 How was New York Pizza Suprema? 1556550856 23 6
## 46 How was Pizza 33? 1395248377 6 1
## 70 How was Steve's Pizza? 1566499944 6 2
## Excellent Fair Good Never Again Poor address
## 7 1 NA 6 0 3 4310 Vance Jackson Rd
## 9 0 NA 0 0 0 21 Kugler Rd
## 13 2 NA 4 0 0 1B New Orleans Rd
## 22 6 NA 5 0 3 852 8th Ave
## 28 1 NA 4 1 3 3009a Middletown Rd
## 33 2 NA 4 2 0 2 E 33rd St
## 34 12 NA 4 0 1 2 E 33rd St
## 39 4 NA 9 0 4 413 8th Ave
## 46 1 NA 3 0 1 489 3rd Ave
## 70 1 NA 1 0 2 12010 Biscayne Blvd
## city country province latitude longitude
## 7 San Antonio US TX 29.51304 -98.53492
## 9 Royersford US PA 40.23758 -75.53564
## 13 Hilton Head Island US SC 32.15921 -80.75271
## 22 New York US NY 40.76304 -73.98539
## 28 Bronx US NY 40.84398 -73.83002
## 33 New York US NY 40.74769 -73.98488
## 34 New York US NY 40.74769 -73.98488
## 39 New York US NY 40.75005 -73.99501
## 46 New York US NY 40.74515 -73.97852
## 70 Miami US FL 25.88696 -80.16447
## categories price_range_min price_range_max
## 7 Restaurant,Pizza Place 0 25
## 9 Restaurant,Pizza Place 0 25
## 13 Restaurant,Pizza Place 0 25
## 22 Pizza,Restaurant,Italian,Pizza Place 0 25
## 28 Restaurant,Pizza Place 0 25
## 33 Pizza,Restaurant,Italian,Pizza Place 0 25
## 34 Pizza,Restaurant,Italian,Pizza Place 0 25
## 39 Pizza,Restaurant,Pizza Place 0 25
## 46 Pizza,Restaurant,Pizza Place 0 25
## 70 Restaurant,Pizza Place 0 25
After checking unique values in uszips, we mapped State Name from uszips dataset to pizza_barstool dataset. We were unable to map the state names for 3 records as we did not find a zip-state mapping for them.
#checking the unique values in UsZips data
unique(uszips$zip)
#Storing only zip, State Name and County Name in new dataframe
uszips_1 <- uszips[ , c("zip" , "state_name" , "county_name")]
# Converting zip to numeric format as it is in Pizza Barstool data
uszips_1$zip <- as.numeric(uszips_1$zip)
# Pulling state name from USZIP data to Barstool dataset
pizza_barstool_1 <- merge(x = pizza_barstool, y = uszips_1, by.x = "zip", by.y = "zip", all.x = TRUE)
# Checking count of missing values in the merged data
colSums(is.na(pizza_barstool_1))
# Checking missing state name records
pizza_barstool_1[is.na(pizza_barstool_1$state_name),]
As maximum records are from New York City, we have created a dataset for 286 pizza places which are near the New York City from Pizza_Barstool applying filters shown in code below:
# Sorting the Barstool data by New York city , Latitudes & Longitudes
pizza_barstool_2 <- pizza_barstool_1 %>% filter(state_name == "New York" & between(latitude,39.5,41.5) & between(longitude,-74.5,-72.5))
# Displaying first few records for filtered dataset
head(pizza_barstool_2)
## zip name address1 city country latitude
## 1 10001 10th Avenue Pizza & Cafe 256 10th Ave New York US 40.74875
## 2 10001 Amtrak 234 W 31st St New York US 40.74965
## 3 10001 Company 230 9th Ave New York US 40.74715
## 4 10001 Highline Pizza 503 W 28th St New York US 40.75109
## 5 10001 B & W Deli & Pizzeria 373 W 34th St New York US 40.75346
## 6 10001 Famous Amadeus Pizza 408 8th Ave New York US 40.74981
## longitude price_level provider_rating provider_review_count
## 1 -74.00313 1 3.5 24
## 2 -73.99340 0 3.0 345
## 3 -74.00052 2 3.5 549
## 4 -74.00223 1 4.0 106
## 5 -73.99602 1 3.0 30
## 6 -73.99478 1 3.0 206
## review_stats_all_average_score review_stats_all_count
## 1 5.340000 5
## 2 0.540000 2
## 3 7.200000 1
## 4 4.300000 3
## 5 5.600000 4
## 6 6.042857 7
## review_stats_all_total_score review_stats_community_average_score
## 1 26.70 4.95
## 2 1.08 1.00
## 3 7.20 0.00
## 4 12.90 5.60
## 5 22.40 5.40
## 6 42.30 6.10
## review_stats_community_count review_stats_community_total_score
## 1 4 19.8
## 2 1 1.0
## 3 0 0.0
## 4 2 11.2
## 5 3 16.2
## 6 6 36.6
## review_stats_critic_average_score review_stats_critic_count
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## review_stats_critic_total_score review_stats_dave_average_score
## 1 0 6.90
## 2 0 0.08
## 3 0 7.20
## 4 0 1.70
## 5 0 6.20
## 6 0 5.70
## review_stats_dave_count review_stats_dave_total_score state_name
## 1 1 6.90 New York
## 2 1 0.08 New York
## 3 1 7.20 New York
## 4 1 1.70 New York
## 5 1 6.20 New York
## 6 1 5.70 New York
## county_name
## 1 New York
## 2 New York
## 3 New York
## 4 New York
## 5 New York
## 6 New York
Apart from having Name, Address and Price Range of Pizza Places Barstool’s data had following different types of ratings:
There is also a rating which is a weighted average of Community, Critic and Dave rating. In our EDA phase, we will analyse the correlation between different types of ratings and arrive at a consolidated rating which will be a good reflection of all the other ratings.
# As all records are from New York we will only keep drop city, state_name, country, latitude and longitude. We will keep county name as we will use it afterwords when dividing whole city into different counties
pizza_barstool_3 <- subset(pizza_barstool_2,select = -c(city,country,latitude,longitude,state_name))
# Checking Summary Stats of Numerical columns
summary(pizza_barstool_3)
# Multiplying Provider Rating by 2
pizza_barstool_3$provider_rating <- 2*pizza_barstool_3$provider_rating
# Scatter Plot Matrix for all the ratings
ggpairs(pizza_barstool_3, columns = c(5,7,10,13,16)) + ggtitle("Scatterplots of all the available Ratings")