This is for Pizza Lovers

Our Analysis

About me and the Data we will be working on

About me

Hi I am Sahil and an aspiring Data Analyst. This is the very begining of my career where I am planning to be an efficient R coder. To know more about me you can find it out at -Sahil Kala

Our Data

As a part of the project, my professon Justin Jordey provided us with a list of available data to choose from. And since I love Pizza, I chose this data set which will enable you and me to find the best Pizza place in NY! The data has indepth information about Restaurants. Like - Ratings, reviews, location and price ranges. In our analysis - We will be using R and its various packages to drill deep into the data and find out information which will enable foodies like you and me to get the most out of our hard earned money(or just save you parents money :p) Isn’t is Great?

Brief Overview

The data we are using n the project was part of TidyTuesday which is a weekly social data project in R. Here people post dataset and the community use it to make great projects and bring out best analysis and insights.

Here we have got three CSV files -

Pizza_Barstool.csv

The Barstool sports dataset has critic, public, and the Barstool Staff’s rating as well as pricing, location, and geo-location.
This csv contains 464 rows and 22 columns
Unknown values are shown by NA
If a review is not available or not done, it gets a score of 0. For example if a Restaurant is not reviewed by a critic then it gets a score of 0 and count as 0. So, a score of 0 doesnt mean it is a bad restaurant

Pizza_Jared.csv

Jared’s data is from top NY pizza restaurants, with a 6 point likert scale survey on ratings.
This csv contains 76 rows and 9 columns

Pizza_Datafinity.csv

This includes 10000 pizza places, their price ranges and geo-locations.
It has 10000 rows and 10 columns
Looking at the data we found out the there are many duplicates
We also found out that restaurant’s categories are combined into a single column

To know more about the dataset and the authors - Click Here

The approach

Every dataset can provide a ton of ionformation. Hence, so does our dataset. But looking at the time constraint and amount of R knowledge I have(I will keep on learning more :) ), we will still to a basic data exploration. But, just because I am saying basic doesnt mean it wont have any impact on consumer knowledge.

Problem and the solution!

So, you love pizza and want to eat the best when you are in NYC? The bad news is your tummy has limited capacity and once you end up at a bad pizza place you will regret and so does your stomach.

Now the question is how to find those amazing pizza places? The simplest answer is follow the reviews and ratings the restaurants have received by our fellow pizza lovers, critics and companies like Barstool.

Also, people like Jack, who is a vegetarian (seriously that sound so bland ). Anyway he loves pizza too but doesnt get the veg options at all the places. And since he is going to NYC, we will find a place for him to satisfy his pizza craving.

Rating

Our data consist of a set of ratings for most of the pizza places around US, but we will limit our knowledge to NYC. You will learn more about our data cleaning and other trasformation later. But for now as part of our solution, we will try to find out the top 5 restaurants based on diffrent ratings and also try to find out if diffrent rating system has same results.

The choice of visualization would be simple bar charts which convey information very well to the readers.

Price

We will also try to find out whether top restaurants are expensive or not. Because students like me who love pizza might not be able to afford those good pizzas. Here, we will check the price level for the top 5 restaurants based on average ratings. The choice of plot would be scatter plot.

Libraries needed

To go ahead with project analysis we would be using a few libraries which will help us through tasks like 1. Data Cleaning 2. Make great Visualsation 3. Data manipulation

Let’s first load the libraries needed

#  "dplyr",        for data wrangling
#  "ggplot2",      for drawing nicer graphics
#  "Tidyverse"     for clean, process, model, and visualize data
#  "stringr"       for Simple, Consistent Wrappers for Common String Operations
#  "DT"            for the JavaScript Library 'DataTables'
#  "gridExtra"     for arranging charts'

# Loading required (CRAN) packages
library(dplyr)
library(ggplot2)
library(tidyverse)
library(DT)
library(stringr)
library(gridExtra)

Data Import

In today’s world there are myraid ways of capturing the data and is done at various levels. And this results in data which is untidy or you can say ‘dirty’. Any data science/ analytics project cannot be a success if the individual(s) do not pay attention to the data being worked on. The first and foremost step of any project is to know in and out of the data throughly. Individuals needs to know what that data signifies and what level of importance each variable has. Applying data cleaning techniques can make your project so much easier, it will make your predictions and analysis so much easier and thus in turn will be useful for the user.

So, now we know the secret of a successful project! Let’s apply the data cleaning tips and tricks learnt in class on our data set and be on a journey to find the best Pizza in the world, I mean in NYC :)

Data Import Before we start cleaning, we first need to import our data set(s) in this project. (Be sure to have the source csv files avaliable in the project folder)

barstool <- read.csv(file= 'pizza_barstool.csv')
jared_data <- read.csv(file='pizza_jared.csv')
datafiniti <- read.csv(file='pizza_datafiniti.csv')

Also, let’s take a glimpse of the data

Barstool data

name	address1	city	zip	country	latitude	longitude	price_level	provider_rating	provider_review_count	review_stats_all_average_score	review_stats_all_count	review_stats_all_total_score	review_stats_community_average_score	review_stats_community_count	review_stats_community_total_score	review_stats_critic_average_score	review_stats_critic_count	review_stats_critic_total_score	review_stats_dave_average_score	review_stats_dave_count	review_stats_dave_total_score
Pugsley’s Pizza	590 E 191st St	Bronx	10458	US	40.85877	-73.88484	1	4.5	121	8.011111	27	216.3	7.992000	25	199.8	8.8	1	8.8	7.7	1	7.7
Williamsburg Pizza	265 Union Ave	Brooklyn	11211	US	40.70808	-73.95090	1	3.0	281	7.774074	27	209.9	7.742308	26	201.3	0.0	0	0.0	8.6	1	8.6

Jared data

polla_qid	answer	votes	pollq_id	question	place	time	total_votes	percent
2	Excellent	0	2	How was Pizza Mercato?	Pizza Mercato	1344361527	13	0.0000
2	Good	6	2	How was Pizza Mercato?	Pizza Mercato	1344361527	13	0.4615

Datainfiniti data

name	address	city	country	province	latitude	longitude	categories	price_range_min	price_range_max
Shotgun Dans Pizza	4203 E Kiehl Ave	Sherwood	US	AR	34.83230	-92.1838	Pizza,Restaurant,American restaurants,Pizza Place,Restaurants	0	25
Sauce Pizza Wine	25 E Camelback Rd	Phoenix	US	AZ	33.50927	-112.0730	Pizza,Pizza Place,Restaurants	0	25

Data Cleaning

In this section, we will go through each dataset of this project and find out what all issues are associated to them with respect to data cleaning.

Barstooldata

First lets find out if we have any missing values in Barstool data.
For simplicity,we will remove the rows with missing values.
We will remove the duplicate rows if any.
We will also remove the unnecessary columns not needed for our project.
Remove unrelevent data for example - We are only focusing on NY restaurants and Barstool data has Restaurants from other locations too.

# Lets find out if any column has missing values and also get the count
missing_barstool <- sapply(barstool, function(x) sum(is.na(x)))
knitr::kable(missing_barstool, format = "html")

	x
name	0
address1	0
city	0
zip	0
country	0
latitude	2
longitude	2
price_level	0
provider_rating	0
provider_review_count	0
review_stats_all_average_score	0
review_stats_all_count	0
review_stats_all_total_score	0
review_stats_community_average_score	0
review_stats_community_count	0
review_stats_community_total_score	0
review_stats_critic_average_score	0
review_stats_critic_count	0
review_stats_critic_total_score	0
review_stats_dave_average_score	0
review_stats_dave_count	0
review_stats_dave_total_score	0

Here we can see the there are 2 columns longitude and latitude which have missing values and thus we will remove them.

#removing the missing data rows
barstoolcleaned <- drop_na(barstool)
#Keep only Distinct rows
barstoolDistinct <- distinct(barstoolcleaned)
#Filtering out NY data. Can use the below command if you want to see distinct values in the city column
#unique(barstoolDistinct$city)
barstoolNY <- filter(barstoolDistinct, city == 'New York')
#After running the previous command rows have been reduced to 249
nrow(barstoolNY)

## [1] 249

#lets see the new transformed data
datatable(barstoolNY)

Now lets see what we can clean for Jared’s data

By looking at the data we can see that there was supposed to be 6 types of rating but most of them have only 5 and are missing the rating Fair. Since there is only 1 restaurant with rating Fair - We will drop that record and go ahead with remaining 5 ratings.

jared_data_cleaned <- filter(jared_data,answer != 'Fair')
datatable(jared_data_cleaned)

Pizza_ DataInfiniti has the following isssues

Lots of duplicates.
Consist data for many cities and we only need data for New York

So, let’s clean it

datafiniti_clean <- distinct(datafiniti) %>% filter(city == 'New York')
datatable(datafiniti_clean)

Also, one last thing we need to do before we are done with the initial cleaning of Datainfinity file. Look at the variable categories. It s a multivalue column. We can convert it into separate columns.

First find out how many columns do we need. Below command will tell us the row which has highest categories and hence we will create those many columns.

max(str_count(datafiniti_clean$categories,','))

## [1] 6

datafiniti_clean_unmerged <- datafiniti_clean %>% 
  separate(categories,into=c('cat1','cat2','cat3','cat4','cat5','cat6','cat7'),',')
# Lets take a look if that worked
datatable(head(datafiniti_clean_unmerged,5))

Now we can use this transformed table to find a restaurant with a particular category. For example you may want to find a place which is Kosher

Now since we have done the initial cleaning of the data we will look at the summary statistics of the dataset to find if anything needs to looked into or cleaned

#Glimpse will provide a summary of columns and datatype
glimpse(barstoolNY)

## Observations: 249
## Variables: 22
## $ name                                 <fct> 99 Cent Fresh Pizza, Nino...
## $ address1                             <fct> 473 Lexington Ave, 39 W 4...
## $ city                                 <fct> New York, New York, New Y...
## $ zip                                  <int> 10017, 10036, 10003, 1002...
## $ country                              <fct> US, US, US, US, US, US, U...
## $ latitude                             <dbl> 40.75370, 40.75690, 40.73...
## $ longitude                            <dbl> -73.97411, -73.98052, -73...
## $ price_level                          <int> 1, 2, 2, 1, 1, 2, 2, 2, 1...
## $ provider_rating                      <dbl> 4.0, 4.0, 3.0, 3.0, 4.0, ...
## $ provider_review_count                <int> 118, 1055, 143, 95, 324, ...
## $ review_stats_all_average_score       <dbl> 5.666667, 5.600000, 7.100...
## $ review_stats_all_count               <int> 9, 2, 1, 5, 17, 14, 6, 37...
## $ review_stats_all_total_score         <dbl> 51.0, 11.2, 7.1, 30.6, 12...
## $ review_stats_community_average_score <dbl> 5.762500, 0.000000, 0.000...
## $ review_stats_community_count         <int> 8, 0, 0, 4, 16, 13, 4, 36...
## $ review_stats_community_total_score   <dbl> 46.1, 0.0, 0.0, 24.5, 119...
## $ review_stats_critic_average_score    <dbl> 0.0, 4.3, 0.0, 0.0, 0.0, ...
## $ review_stats_critic_count            <int> 0, 1, 0, 0, 0, 0, 1, 0, 1...
## $ review_stats_critic_total_score      <dbl> 0.0, 4.3, 0.0, 0.0, 0.0, ...
## $ review_stats_dave_average_score      <dbl> 4.9, 6.9, 7.1, 6.1, 7.2, ...
## $ review_stats_dave_count              <int> 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ review_stats_dave_total_score        <dbl> 4.9, 6.9, 7.1, 6.1, 7.2, ...

glimpse(jared_data_cleaned)

## Observations: 374
## Variables: 9
## $ polla_qid   <int> 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5,...
## $ answer      <fct> Excellent, Good, Average, Poor, Never Again, Excel...
## $ votes       <int> 0, 6, 4, 1, 2, 1, 1, 3, 1, 1, 4, 2, 1, 1, 0, 1, 1,...
## $ pollq_id    <int> 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5,...
## $ question    <fct> How was Pizza Mercato?, How was Pizza Mercato?, Ho...
## $ place       <fct> Pizza Mercato, Pizza Mercato, Pizza Mercato, Pizza...
## $ time        <int> 1344361527, 1344361527, 1344361527, 1344361527, 13...
## $ total_votes <int> 13, 13, 13, 13, 13, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, ...
## $ percent     <dbl> 0.0000, 0.4615, 0.3077, 0.0769, 0.1538, 0.1429, 0....

glimpse(datafiniti_clean)

## Observations: 95
## Variables: 10
## $ name            <fct> Salvos Pizzabar, East Broadway Pizza, Dona Bel...
## $ address         <fct> 1477 York Ave, 181 E Broadway, 154 Church Stee...
## $ city            <fct> New York, New York, New York, New York, New Yo...
## $ country         <fct> US, US, US, US, US, US, US, US, US, US, US, US...
## $ province        <fct> NY, NY, NY, NY, NY, NY, NY, NY, NY, NY, NY, NY...
## $ latitude        <dbl> 40.77083, 40.71420, 40.71499, 40.79013, 40.710...
## $ longitude       <dbl> -73.95102, -73.98939, -74.00785, -73.97320, -7...
## $ categories      <fct> "Restaurant,Pizza Place Upper East Side,Pizza ...
## $ price_range_min <int> 25, 0, 0, 0, 25, 25, 25, 0, 25, 0, 0, 0, 25, 0...
## $ price_range_max <int> 40, 25, 25, 25, 40, 40, 40, 25, 40, 25, 25, 25...

Looking at the output of the glimpse will help up later if we need to change the datatype of a particular variable for exploratory analysis

Now lets dive into our data and look closely at variables(columns), its outlier and numerical summary of them.

Lets look at Barstool Data.

summary(barstoolNY)

##                  name              address1           city    
##  Joe's Pizza       :  2   268 W 23rd St:  2   New York  :249  
##  Justino's Pizzeria:  2   936 8th Ave  :  2   Alpharetta:  0  
##  Kiss My Slice     :  2   1 Perry St   :  1   Ann Arbor :  0  
##  Little Italy Pizza:  2   1005 2nd Ave :  1   Atlanta   :  0  
##  Mariella Pizza    :  2   101 W 68th St:  1   Augusta   :  0  
##  Patsy's Pizzeria  :  2   1055 1st Ave :  1   Austin    :  0  
##  (Other)           :237   (Other)      :241   (Other)   :  0  
##       zip        country     latitude       longitude       price_level   
##  Min.   :10001   US:249   Min.   :40.64   Min.   :-74.13   Min.   :0.000  
##  1st Qu.:10011            1st Qu.:40.73   1st Qu.:-74.00   1st Qu.:1.000  
##  Median :10014            Median :40.75   Median :-73.99   Median :1.000  
##  Mean   :10021            Mean   :40.74   Mean   :-73.99   Mean   :1.398  
##  3rd Qu.:10022            3rd Qu.:40.76   3rd Qu.:-73.98   3rd Qu.:2.000  
##  Max.   :10280            Max.   :40.80   Max.   :-73.93   Max.   :3.000  
##                                                                           
##  provider_rating provider_review_count review_stats_all_average_score
##  Min.   :2.000   Min.   :   4.0        Min.   :0.100                 
##  1st Qu.:3.500   1st Qu.:  76.0        1st Qu.:6.019                 
##  Median :3.500   Median : 162.0        Median :6.900                 
##  Mean   :3.606   Mean   : 323.7        Mean   :6.657                 
##  3rd Qu.:4.000   3rd Qu.: 345.0        3rd Qu.:7.658                 
##  Max.   :5.000   Max.   :5797.0        Max.   :8.967                 
##                                                                      
##  review_stats_all_count review_stats_all_total_score
##  Min.   :  1.0          Min.   :   0.1              
##  1st Qu.:  3.0          1st Qu.:  20.3              
##  Median :  6.0          Median :  39.7              
##  Mean   : 16.3          Mean   : 127.2              
##  3rd Qu.: 13.0          3rd Qu.:  89.4              
##  Max.   :568.0          Max.   :5045.6              
##                                                     
##  review_stats_community_average_score review_stats_community_count
##  Min.   :0.000                        Min.   :  0.0               
##  1st Qu.:5.725                        1st Qu.:  2.0               
##  Median :6.900                        Median :  5.0               
##  Mean   :5.973                        Mean   : 15.1               
##  3rd Qu.:7.650                        3rd Qu.: 12.0               
##  Max.   :8.942                        Max.   :567.0               
##                                                                   
##  review_stats_community_total_score review_stats_critic_average_score
##  Min.   :   0.0                     Min.   : 0.000                   
##  1st Qu.:  12.5                     1st Qu.: 0.000                   
##  Median :  30.6                     Median : 0.000                   
##  Mean   : 119.3                     Mean   : 1.348                   
##  3rd Qu.:  81.6                     3rd Qu.: 0.000                   
##  Max.   :5036.3                     Max.   :11.000                   
##                                                                      
##  review_stats_critic_count review_stats_critic_total_score
##  Min.   :0.0000            Min.   : 0.000                 
##  1st Qu.:0.0000            1st Qu.: 0.000                 
##  Median :0.0000            Median : 0.000                 
##  Mean   :0.2008            Mean   : 1.444                 
##  3rd Qu.:0.0000            3rd Qu.: 0.000                 
##  Max.   :5.0000            Max.   :29.800                 
##                                                           
##  review_stats_dave_average_score review_stats_dave_count
##  Min.   :0.080                   Min.   :1              
##  1st Qu.:5.900                   1st Qu.:1              
##  Median :6.900                   Median :1              
##  Mean   :6.487                   Mean   :1              
##  3rd Qu.:7.800                   3rd Qu.:1              
##  Max.   :9.300                   Max.   :1              
##                                                         
##  review_stats_dave_total_score
##  Min.   :0.080                
##  1st Qu.:5.900                
##  Median :6.900                
##  Mean   :6.487                
##  3rd Qu.:7.800                
##  Max.   :9.300                
##

The summary output tells us a lot about the data. It also produce output based on datatype of the variable. For example if something is categorical/string variable we will see a set of values and it counts and its example is name variable.

Also, if a variable is numeric(real, integer, double) we will see some statistical output for it. example here is provider_review_count

A lot can found out by the previous output-

looking at the name variable, we see many restaurants have count 2 but didnt we remove duplicates? Let’s not come to the conclusion that our data is still dirty. Lets first see these rows. We will take an example of Joe’s Pizza

datatable(filter(barstoolNY,name=="Joe's Pizza"))

YOU SEE! These rows are not duplicate and infact are 2 diffrent locations of a Pizza chain. So we are good here. We will not merge these rows as you can see a stark diffrence in the rating between these 2 locations. Shouldn’t they taste the same?

Now, look at the numeric variable output and we clearly see there are outliers in the data. Let’s take a look at a particular example - provider_review_count. Look at the min and max value. There is such a vast diffrence and these outliers which are far from median can produce some inaccurate results if considered in the calculations.

Now the question comes- Should be keep them or not? The answer is IT DEPENDS! Ask your stakeholders if they really need the info about them or if is it ok we can ignore them in our calculations? Sometimes we can ignore them if we are looking at the overall picture or trying to fit a model. In our case we will keep only the outliers which are on the higher side because more the ratings more accurate the information will be. Ignoring the lower bound outliers will make sure that we do not include untrustworthy data. For example if some restaurant has only 4 reviews and have all 5 stars then you might think it is better than a restaurant with 4.2 rating but 1000 reviews. All those 4 reviews can be the paid reviews or from its owner itself.

so lets remove the lowerbound outliers -

# We will keep only those restaurants which have at least 75 reviewsby the provider. You can choose any other variable/review column as the filter 
barstoolNY_Cleaned <- filter(barstoolNY,provider_review_count >= 75)

As we progress through our project, we may find out some more issues and can come back to this task of data cleaning. But as of now, I think we are good.

Merging/Joining

Also, since data is scattered in 3 different files we would be joining them to bring the data together.

Note- We will show the merging as part of final exam.

Exploratory analysis

Ratings

By this time you would be knowing for sure that we are on a mission to find the best Pizza place in New York. But- As we all are different so do our preferences. Meaning of ‘Best Pizza’ is subjective and can include or exclude many factors.

So, we will try to find out the top 5 best pizza place based on -

Overall ratings
Value for money/price level
Community ratings
Barstool ratings
Critic ratings

All of these can be easily showed by using beautiful graphs made on ggplot2 Choice of graph - Barchart

1. Overall ratings

Just for the sake of simplicity, for this viz we wil use the average score of all reviews, which is ‘review_stats_all_average_score’ column in barstool dataset. Here we will make 2 charts. 1. Best pizza places among the restaurants present in Barstool. 2. Only those restaurants present in Jared’s dataset.

Now let’s see who has the highest excellent rantings based on Jared’s data and rating and check if those restaurants are rated highly by Barstool too.

Comparing this viz with previous viz tells us that Barstool Ratings are not in line with Jared’s ratings as none of the top restaurants from both graphs overlap.

2. Comparing ratings for top 5 restaurants.

Our first viz was based on the overall average score. Now for these top 5 restaurants we will see how they have been rated by - provider, community, critics and Barstool(Dave)

Finding - we can see that although Serafina Broadway restaurant is not the highest rated restaurant by provider, community and Barstool but it still tops the overall average score. The reason behind it that it has been highly rated by critic. So, this forces us to think whether critic was paid by the restaurant for good rating???

3. It is all about money. One might think highest rated restaurants might be expensive and so do I. So lets find out whether our top 5 restaurants are expensive or not.

Good news, not all top rated restaurants are expensive. Sauce Pizzeria has a lower price level. Thus students like me can enjoy best pizza for cheap.

4. Best Veg Restaurant

In united States, sometimes it is very hard to find Vegetarian food. So, for our vegetarian friends lets find a Pizza place with highest rating.

Aha! We have found a place for our friend Jack. (Remember he is the vegetarian guy )

Conclusion

Although, I am completing this project at this point, It doesnt mean that the journey of finding more information about pizza restaturants of NYC ends. There are a lot of things we could have included in this project but unfortunately I need more time and R coding knowledge to go about with it.

Before I bid adieu, lets summarize what we did. We started with a set of data files which had information about pizza places across US, their locations, ratings and price range. We set on a journey to find the best rated pizza places, compared them across different rating systems and also looked at the price range.

We also found out the best vegetarian pizza place for our friends like Jack.

I hope this information would be useful for the reader of this report and I wish them finding the tastiest pizza in NY.