If you’re heading to New York and in search of good pizza? Check out our exploration of the city’s best Pizza and satisfy your cravings.

Introduction

Problem Statement:
The objective of this project is to recommend top Pizzerias in different price ranges in the vicinity of New York City. User can search the locality and select the Price range to view the top recommendations.

Approach:

Thanks to the #tidytuesday initiative we were able to get our hands on the pizza datasets viz Pizza_Jared, Pizza_Barstool and Pizza_Datafiniti.

After an initial analysis of the above dataset, we found that the maximum number of Pizzerias are in New York City and its neighbourhood. Hence, we focused on New York City data and consolidated all the available ratings into a single metric to recommend top Pizza Places in New York. We leveraged the geo-location and price range information present in Barstool’s and Datafiniti’s data to categorize the Pizza Places according to its location and price range.

Packages Required

We used the following packages to arrive at our recommendations:

  • tidyverse : Used in data processing and data transformation as well as for data visualization
  • readr : Used for importing data CSV files
  • GGally : Used for pairwise plots
library(tidyverse) # Used in data processing and data transformation as well as for data visualization 
library(readr)    # Used for importing data CSV files
library(GGally)   # Used for pairwise plots

Data Preparation

Data Import

These data sets were part of #tidytuesday initiative. More information about it can be found here.

There are three data sets:

  • Pizza_Jared: Jared’s data is from top NY pizza restaurants, with a 6 point likert scale survey on ratings.

  • Pizza_Barstool: The Barstool sports dataset has critic, public, and the Barstool Staff’s rating as well as pricing, location, and geo-location.

  • Pizza_Datafiniti: The Dafiniti includes 10000 pizza places, thier price ranges and location.

First, we import all the datasets in R. Please find below the code for importing all the datasets:

pizza_jared <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-01/pizza_jared.csv")
pizza_barstool <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-01/pizza_barstool.csv")
pizza_datafiniti <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-01/pizza_datafiniti.csv")

Importing US ZIPs to pull State for observations where state information is not available.

The ZIPCODE-STATE mapping was downloaded from simplemaps.com.

uszips <- readr::read_csv('uszips.csv')

Data Exploration and Cleaning

Pizza Barstool Dataset

In Pizza Barstool dataset we found that:

  • There are 463 observations and 22 variables
  • zip column has some entries which have 4 digit zipcodes, hence it is needed to front pad these zip codes with ‘0’ in order to pull the state information from uszips dataset
  • 2 missing values are present in latitude and longitude columns. It can be pulled from uszips dataset based on the zip code
# Checking the top 10 rows of the dataset
head(pizza_barstool)

#structure of the datset
str(pizza_barstool)

#High level Summary of data
summary(pizza_barstool)

#Checking missing values in each column
colSums(is.na(pizza_barstool))
# Checking Record having missing value
pizza_barstool[rowSums(is.na(pizza_barstool)) > 0,]

# Checking Duplicates in data
count(unique(pizza_barstool))
  • There are 451 unique pizza places at 463 different locations
  • 11 Pizza places are present at multiple locations
# Checking count of unique observation in each variable
pizza_barstool %>% 
  summarise_each(funs(n_distinct))
## # A tibble: 1 x 22
##    name address1  city   zip country latitude longitude price_level
##   <int>    <int> <int> <int>   <int>    <int>     <int>       <int>
## 1   452      460    99   176       1      458       456           4
## # ... with 14 more variables: provider_rating <int>,
## #   provider_review_count <int>, review_stats_all_average_score <int>,
## #   review_stats_all_count <int>, review_stats_all_total_score <int>,
## #   review_stats_community_average_score <int>,
## #   review_stats_community_count <int>,
## #   review_stats_community_total_score <int>,
## #   review_stats_critic_average_score <int>,
## #   review_stats_critic_count <int>,
## #   review_stats_critic_total_score <int>,
## #   review_stats_dave_average_score <int>, review_stats_dave_count <int>,
## #   review_stats_dave_total_score <int>
# Checking Count at Name and address level combined
count(unique(pizza_barstool[c('name','address1')]))
## # A tibble: 1 x 1
##       n
##   <int>
## 1   463
# Checking unique pizza places
length(unique(tolower(pizza_barstool$name)))
## [1] 451
  • Maximum Number of Pizza Places (~54%) are from New York City followed by Brooklyn (4%)
# Checking cities with maximum pizza places
city_table <- as.data.frame(table(pizza_barstool$city))
city_table_top <- (city_table %>% top_n(10))
ggplot(city_table_top, aes(x = reorder(city_table_top$Var1, -city_table_top$Freq), y = city_table_top$Freq)) + geom_bar(stat = "identity") + labs(title = "Count of Pizza Places per city", x = "City", y = "Number of Pizza Places")

Pizza Datafiniti Dataset

Pizza_Datafiniti is a repository of Pizza Places and contains information regarding address and price range of Pizza Places

  • There are 10,000 observations 10 variables
  • Out of 10,000 observations only 2,285 observations are unique
# Checking the top 10 rows of the dataset
head(pizza_datafiniti)

#structure of the datset
str(pizza_datafiniti)

#High level Summary of data
summary(pizza_datafiniti)

#Checking missing values in each column
colSums(is.na(pizza_datafiniti))

# Checking Duplicates in data
count(unique(pizza_datafiniti))

# Removing Duplicates
pizza_datafiniti_2 <- unique(pizza_datafiniti)

#qc
count(pizza_datafiniti_2)
  • There are 1,817 pizza places located at 2,278 different locations
# Checking count of unique observations in each variable
pizza_datafiniti_2 %>% 
  summarise_each(funs(n_distinct))
## # A tibble: 1 x 10
##    name address  city country province latitude longitude categories
##   <int>   <int> <int>   <int>    <int>    <int>     <int>      <int>
## 1  1827    2278  1028       1       44     2284      2284        456
## # ... with 2 more variables: price_range_min <int>, price_range_max <int>
# Checking unique pizza places
length(unique(tolower(pizza_datafiniti_2$name)))
## [1] 1817
  • The maximum number of Pizza Places are from New York, followed by Brooklyn
# Checking cities with maximum pizza places
city_table <- as.data.frame(table(pizza_datafiniti_2$city))
city_table_top <- (city_table %>% top_n(10))
ggplot(city_table_top, aes(x = reorder(city_table_top$Var1, -city_table_top$Freq), y = city_table_top$Freq)) + geom_bar(stat = "identity") + labs(title = "Count of Pizza Places per city", x = "City", y = "Number of Pizza Places") + expand_limits(y = 100)

Pizza Jared Dataset

After initial data exploration process with Pizza_Jared dataset, we found that:

  • There are 375 observation and 9 variables
  • There are 5 missing values in percent because no survey rating was available for a place called Bravo Pizza
  • The dataset does not have any information regarding the address and price range, hence it is needed to pull this information from Pizza_Datafiniti dataset
# Checking the top 10 rows of the dataset
head(pizza_jared)

#structure of the datset
str(pizza_jared)

#High level Summary of data
summary(pizza_jared)

#Checking missing values in each column
colSums(is.na(pizza_jared))
# Checking Record having missing value
pizza_jared[rowSums(is.na(pizza_jared)) > 0,]

# Checking Duplicates in data
count(unique(pizza_jared))
  • There are 5 types of Ratings for 75 different polla_qid
  • There are 6 different types of rating {Excellent,Good,Average,Fair,Poor,Never Again} having 75 records each, out of which Fair have only one record and Never Again has 74 records
  • The data is in long format and we need to spread the using polla_qid as key
  • Despite having 75 polla_qid, the number of unique places are 56. We cannot arrive at the conclusion that duplicate places are the same or different as we don’t have any information about their addresses.
# Checking count of unique observations in each variable
pizza_jared %>% 
  summarise_each(funs(n_distinct))
## # A tibble: 1 x 9
##   polla_qid answer votes pollq_id question place  time total_votes percent
##       <int>  <int> <int>    <int>    <int> <int> <int>       <int>   <int>
## 1        75      6    20       75       56    56    75          30      94
# Checking total count for each of the answers
table(pizza_jared$answer)
## 
##     Average   Excellent        Fair        Good Never Again        Poor 
##          75          75           1          75          74          75
  • Spreading the pizza_jared dataset and pulling address and pizza range from pizza_datafiniti dataset:
# Dropping Percent column and spreading the data
pizza_jared_2 <- (pizza_jared %>% 
                    subset(select = -percent) %>% 
                      spread(answer,votes))

# Pulling the values form Datafiniti
pizza_jared_3 <- merge(x = pizza_jared_2, y = pizza_datafiniti_2, by.x = "place", by.y = "name", all.x = TRUE)
  • Only 10 records have address and pricing information from the pizza_jared dataset, hence we will be dropping pizza_jared dataset from our analysis
# Checking Missing Values in each column
colSums(is.na(pizza_jared_3))
##           place       polla_qid        pollq_id        question 
##               0               0               0               0 
##            time     total_votes         Average       Excellent 
##               0               0               0               0 
##            Fair            Good     Never Again            Poor 
##              74               0               1               0 
##         address            city         country        province 
##              65              65              65              65 
##        latitude       longitude      categories price_range_min 
##              65              65              65              65 
## price_range_max 
##              65
# Displaying data which does not missing value in address
pizza_jared_3[!is.na(pizza_jared_3$address),]
##                     place polla_qid pollq_id
## 7         Big Slice Pizza        58       58
## 9             Bravo Pizza        29       29
## 13             Dough Boys        61       61
## 22           Gotham Pizza        63       63
## 28            Joe's Pizza        10       10
## 33     Little Italy Pizza        36       36
## 34     Little Italy Pizza        60       60
## 39 New York Pizza Suprema        73       73
## 46               Pizza 33        21       21
## 70          Steve's Pizza        76       76
##                           question       time total_votes Average
## 7         How was Big Slice Pizza? 1516811929          15       5
## 9             How was Bravo Pizza? 1422457179           0       0
## 13             How was Dough Boys? 1523891285          13       7
## 22           How was Gotham Pizza? 1529601987          17       3
## 28            How was Joe's Pizza? 1366908424          17       8
## 33     How was Little Italy Pizza? 1453292409          13       5
## 34     How was Little Italy Pizza? 1520532907          21       4
## 39 How was New York Pizza Suprema? 1556550856          23       6
## 46               How was Pizza 33? 1395248377           6       1
## 70          How was Steve's Pizza? 1566499944           6       2
##    Excellent Fair Good Never Again Poor               address
## 7          1   NA    6           0    3 4310 Vance Jackson Rd
## 9          0   NA    0           0    0          21 Kugler Rd
## 13         2   NA    4           0    0     1B New Orleans Rd
## 22         6   NA    5           0    3           852 8th Ave
## 28         1   NA    4           1    3   3009a Middletown Rd
## 33         2   NA    4           2    0           2 E 33rd St
## 34        12   NA    4           0    1           2 E 33rd St
## 39         4   NA    9           0    4           413 8th Ave
## 46         1   NA    3           0    1           489 3rd Ave
## 70         1   NA    1           0    2   12010 Biscayne Blvd
##                  city country province latitude longitude
## 7         San Antonio      US       TX 29.51304 -98.53492
## 9          Royersford      US       PA 40.23758 -75.53564
## 13 Hilton Head Island      US       SC 32.15921 -80.75271
## 22           New York      US       NY 40.76304 -73.98539
## 28              Bronx      US       NY 40.84398 -73.83002
## 33           New York      US       NY 40.74769 -73.98488
## 34           New York      US       NY 40.74769 -73.98488
## 39           New York      US       NY 40.75005 -73.99501
## 46           New York      US       NY 40.74515 -73.97852
## 70              Miami      US       FL 25.88696 -80.16447
##                              categories price_range_min price_range_max
## 7                Restaurant,Pizza Place               0              25
## 9                Restaurant,Pizza Place               0              25
## 13               Restaurant,Pizza Place               0              25
## 22 Pizza,Restaurant,Italian,Pizza Place               0              25
## 28               Restaurant,Pizza Place               0              25
## 33 Pizza,Restaurant,Italian,Pizza Place               0              25
## 34 Pizza,Restaurant,Italian,Pizza Place               0              25
## 39         Pizza,Restaurant,Pizza Place               0              25
## 46         Pizza,Restaurant,Pizza Place               0              25
## 70               Restaurant,Pizza Place               0              25

Data Manupulation

After checking unique values in uszips, we mapped State Name from uszips dataset to pizza_barstool dataset. We were unable to map the state names for 3 records as we did not find a zip-state mapping for them.

#checking the unique values in UsZips data
unique(uszips$zip)

#Storing only zip, State Name and County Name in new dataframe
uszips_1 <- uszips[ , c("zip" , "state_name" , "county_name")]

# Converting zip to numeric format as it is in Pizza Barstool data
uszips_1$zip <- as.numeric(uszips_1$zip)

# Pulling state name from USZIP data to Barstool dataset

pizza_barstool_1 <- merge(x = pizza_barstool, y = uszips_1, by.x = "zip", by.y = "zip", all.x = TRUE)

# Checking count of missing values in the merged data 
colSums(is.na(pizza_barstool_1))

# Checking missing state name records
pizza_barstool_1[is.na(pizza_barstool_1$state_name),]

As maximum records are from New York City, we have created a dataset for 286 pizza places which are near the New York City from Pizza_Barstool applying filters shown in code below:

# Sorting the Barstool data by New York city , Latitudes & Longitudes
pizza_barstool_2 <- pizza_barstool_1 %>% filter(state_name == "New York" & between(latitude,39.5,41.5) & between(longitude,-74.5,-72.5))
# Displaying first few records for filtered dataset
head(pizza_barstool_2)                       
##     zip                     name      address1     city country latitude
## 1 10001 10th Avenue Pizza & Cafe  256 10th Ave New York      US 40.74875
## 2 10001                   Amtrak 234 W 31st St New York      US 40.74965
## 3 10001                  Company   230 9th Ave New York      US 40.74715
## 4 10001           Highline Pizza 503 W 28th St New York      US 40.75109
## 5 10001    B & W Deli & Pizzeria 373 W 34th St New York      US 40.75346
## 6 10001     Famous Amadeus Pizza   408 8th Ave New York      US 40.74981
##   longitude price_level provider_rating provider_review_count
## 1 -74.00313           1             3.5                    24
## 2 -73.99340           0             3.0                   345
## 3 -74.00052           2             3.5                   549
## 4 -74.00223           1             4.0                   106
## 5 -73.99602           1             3.0                    30
## 6 -73.99478           1             3.0                   206
##   review_stats_all_average_score review_stats_all_count
## 1                       5.340000                      5
## 2                       0.540000                      2
## 3                       7.200000                      1
## 4                       4.300000                      3
## 5                       5.600000                      4
## 6                       6.042857                      7
##   review_stats_all_total_score review_stats_community_average_score
## 1                        26.70                                 4.95
## 2                         1.08                                 1.00
## 3                         7.20                                 0.00
## 4                        12.90                                 5.60
## 5                        22.40                                 5.40
## 6                        42.30                                 6.10
##   review_stats_community_count review_stats_community_total_score
## 1                            4                               19.8
## 2                            1                                1.0
## 3                            0                                0.0
## 4                            2                               11.2
## 5                            3                               16.2
## 6                            6                               36.6
##   review_stats_critic_average_score review_stats_critic_count
## 1                                 0                         0
## 2                                 0                         0
## 3                                 0                         0
## 4                                 0                         0
## 5                                 0                         0
## 6                                 0                         0
##   review_stats_critic_total_score review_stats_dave_average_score
## 1                               0                            6.90
## 2                               0                            0.08
## 3                               0                            7.20
## 4                               0                            1.70
## 5                               0                            6.20
## 6                               0                            5.70
##   review_stats_dave_count review_stats_dave_total_score state_name
## 1                       1                          6.90   New York
## 2                       1                          0.08   New York
## 3                       1                          7.20   New York
## 4                       1                          1.70   New York
## 5                       1                          6.20   New York
## 6                       1                          5.70   New York
##   county_name
## 1    New York
## 2    New York
## 3    New York
## 4    New York
## 5    New York
## 6    New York

Exploratory Data Analysis

Apart from having Name, Address and Price Range of Pizza Places Barstool’s data had following different types of ratings:

  • Provider Rating: Pizza Rating available on yelp. All the other ratings are in the scale of 10 except this one, hence we will convert this rating to the scale of 10 by multiplying it by 2
  • Community Rating: Pizza rating of One Bite’s Application User
  • Critic Rating: Pizza Rating is given by Critics
  • Dave Rating: Review Score that Dave Portnoy, Barstool’s reviewer, gave to the location

There is also a rating which is a weighted average of Community, Critic and Dave rating. In our EDA phase, we will analyse the correlation between different types of ratings and arrive at a consolidated rating which will be a good reflection of all the other ratings.

# As all records are from New York we will only keep drop city, state_name, country, latitude and longitude. We will keep county name as we will use it afterwords when dividing whole city into different counties

pizza_barstool_3 <- subset(pizza_barstool_2,select = -c(city,country,latitude,longitude,state_name))

# Checking Summary Stats of Numerical columns
summary(pizza_barstool_3)

# Multiplying Provider Rating by 2
pizza_barstool_3$provider_rating <- 2*pizza_barstool_3$provider_rating

Checking Correlation Between Different Types of Rating

  • Most of the ratings are left-skewed except Critic Average. This is because Critics have rated only 47 pizza places out of 286 places
  • Community Average, too, have many observations where no rating is available hence we have a small peak in its distribution near 0
  • Stats all average is a weighted average of Critic’s score, community’s score and Dave’s score and it also have good correlation(Corr: 0.847) with Dave’s score. Hence Stats all average can be a good metric for the representation of the combination of Critic’s score, community’s score and Dave’s score
  • The correlation between Provider Rating, which is a rating of Pizza Places taken from yelp’s website, and Stats all average is not significant (Corr: 0.407)
# Scatter Plot Matrix for all the ratings
ggpairs(pizza_barstool_3, columns = c(5,7,10,13,16)) + ggtitle("Scatterplots of all the available Ratings")

Next Steps

  • Comparing ranks for restaurants as per Provider Rating & Stats all average, specifically looking for top 20 places as per both ratings and then plotting slope chart to compare the two ratings
  • Categorize county wise pizza places according to their price range
  • Preparing interactive dashboard showing top recommendations for each county as per different price range