About me
Hi I am Sahil and an aspiring Data Analyst. This is the very begining of my career where I am planning to be an efficient R coder. To know more about me you can find it out at -Sahil Kala
Our Data
As a part of the project, my professon Justin Jordey provided us with a list of available data to choose from. And since I love Pizza, I chose this data set which will enable you and me to find the best Pizza place in NY! The data has indepth information about Restaurants. Like - Ratings, reviews, location and price ranges. In our analysis - We will be using R and its various packages to drill deep into the data and find out information which will enable foodies like you and me to get the most out of our hard earned money(or just save you parents money :p) Isn’t is Great?
Brief Overview
The data we are using n the project was part of TidyTuesday which is a weekly social data project in R. Here people post dataset and the community use it to make great projects and bring out best analysis and insights.
Here we have got three CSV files -
To know more about the dataset and the authors - Click Here
Every dataset can provide a ton of ionformation. Hence, so does our dataset. But looking at the time constraint and amount of R knowledge I have(I will keep on learning more :) ), we will still to a basic data exploration. But, just because I am saying basic doesnt mean it wont have any impact on consumer knowledge.
Problem and the solution!
So, you love pizza and want to eat the best when you are in NYC? The bad news is your tummy has limited capacity and once you end up at a bad pizza place you will regret and so does your stomach.
Now the question is how to find those amazing pizza places? The simplest answer is follow the reviews and ratings the restaurants have received by our fellow pizza lovers, critics and companies like Barstool.
Also, people like Jack, who is a vegetarian (seriously that sound so bland ). Anyway he loves pizza too but doesnt get the veg options at all the places. And since he is going to NYC, we will find a place for him to satisfy his pizza craving.
Rating
Our data consist of a set of ratings for most of the pizza places around US, but we will limit our knowledge to NYC. You will learn more about our data cleaning and other trasformation later. But for now as part of our solution, we will try to find out the top 5 restaurants based on diffrent ratings and also try to find out if diffrent rating system has same results.
The choice of visualization would be simple bar charts which convey information very well to the readers.
Price
We will also try to find out whether top restaurants are expensive or not. Because students like me who love pizza might not be able to afford those good pizzas. Here, we will check the price level for the top 5 restaurants based on average ratings. The choice of plot would be scatter plot.
To go ahead with project analysis we would be using a few libraries which will help us through tasks like 1. Data Cleaning 2. Make great Visualsation 3. Data manipulation
Let’s first load the libraries needed
# "dplyr", for data wrangling
# "ggplot2", for drawing nicer graphics
# "Tidyverse" for clean, process, model, and visualize data
# "stringr" for Simple, Consistent Wrappers for Common String Operations
# "DT" for the JavaScript Library 'DataTables'
# "gridExtra" for arranging charts'
# Loading required (CRAN) packages
library(dplyr)
library(ggplot2)
library(tidyverse)
library(DT)
library(stringr)
library(gridExtra)
In today’s world there are myraid ways of capturing the data and is done at various levels. And this results in data which is untidy or you can say ‘dirty’. Any data science/ analytics project cannot be a success if the individual(s) do not pay attention to the data being worked on. The first and foremost step of any project is to know in and out of the data throughly. Individuals needs to know what that data signifies and what level of importance each variable has. Applying data cleaning techniques can make your project so much easier, it will make your predictions and analysis so much easier and thus in turn will be useful for the user.
So, now we know the secret of a successful project! Let’s apply the data cleaning tips and tricks learnt in class on our data set and be on a journey to find the best Pizza in the world, I mean in NYC :)
Data Import Before we start cleaning, we first need to import our data set(s) in this project. (Be sure to have the source csv files avaliable in the project folder)
barstool <- read.csv(file= 'pizza_barstool.csv')
jared_data <- read.csv(file='pizza_jared.csv')
datafiniti <- read.csv(file='pizza_datafiniti.csv')
Also, let’s take a glimpse of the data
Barstool data
| name | address1 | city | zip | country | latitude | longitude | price_level | provider_rating | provider_review_count | review_stats_all_average_score | review_stats_all_count | review_stats_all_total_score | review_stats_community_average_score | review_stats_community_count | review_stats_community_total_score | review_stats_critic_average_score | review_stats_critic_count | review_stats_critic_total_score | review_stats_dave_average_score | review_stats_dave_count | review_stats_dave_total_score |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Pugsley’s Pizza | 590 E 191st St | Bronx | 10458 | US | 40.85877 | -73.88484 | 1 | 4.5 | 121 | 8.011111 | 27 | 216.3 | 7.992000 | 25 | 199.8 | 8.8 | 1 | 8.8 | 7.7 | 1 | 7.7 |
| Williamsburg Pizza | 265 Union Ave | Brooklyn | 11211 | US | 40.70808 | -73.95090 | 1 | 3.0 | 281 | 7.774074 | 27 | 209.9 | 7.742308 | 26 | 201.3 | 0.0 | 0 | 0.0 | 8.6 | 1 | 8.6 |
Jared data
| polla_qid | answer | votes | pollq_id | question | place | time | total_votes | percent |
|---|---|---|---|---|---|---|---|---|
| 2 | Excellent | 0 | 2 | How was Pizza Mercato? | Pizza Mercato | 1344361527 | 13 | 0.0000 |
| 2 | Good | 6 | 2 | How was Pizza Mercato? | Pizza Mercato | 1344361527 | 13 | 0.4615 |
Datainfiniti data
| name | address | city | country | province | latitude | longitude | categories | price_range_min | price_range_max |
|---|---|---|---|---|---|---|---|---|---|
| Shotgun Dans Pizza | 4203 E Kiehl Ave | Sherwood | US | AR | 34.83230 | -92.1838 | Pizza,Restaurant,American restaurants,Pizza Place,Restaurants | 0 | 25 |
| Sauce Pizza Wine | 25 E Camelback Rd | Phoenix | US | AZ | 33.50927 | -112.0730 | Pizza,Pizza Place,Restaurants | 0 | 25 |
In this section, we will go through each dataset of this project and find out what all issues are associated to them with respect to data cleaning.
Barstooldata
# Lets find out if any column has missing values and also get the count
missing_barstool <- sapply(barstool, function(x) sum(is.na(x)))
knitr::kable(missing_barstool, format = "html")
| x | |
|---|---|
| name | 0 |
| address1 | 0 |
| city | 0 |
| zip | 0 |
| country | 0 |
| latitude | 2 |
| longitude | 2 |
| price_level | 0 |
| provider_rating | 0 |
| provider_review_count | 0 |
| review_stats_all_average_score | 0 |
| review_stats_all_count | 0 |
| review_stats_all_total_score | 0 |
| review_stats_community_average_score | 0 |
| review_stats_community_count | 0 |
| review_stats_community_total_score | 0 |
| review_stats_critic_average_score | 0 |
| review_stats_critic_count | 0 |
| review_stats_critic_total_score | 0 |
| review_stats_dave_average_score | 0 |
| review_stats_dave_count | 0 |
| review_stats_dave_total_score | 0 |
Here we can see the there are 2 columns longitude and latitude which have missing values and thus we will remove them.
#removing the missing data rows
barstoolcleaned <- drop_na(barstool)
#Keep only Distinct rows
barstoolDistinct <- distinct(barstoolcleaned)
#Filtering out NY data. Can use the below command if you want to see distinct values in the city column
#unique(barstoolDistinct$city)
barstoolNY <- filter(barstoolDistinct, city == 'New York')
#After running the previous command rows have been reduced to 249
nrow(barstoolNY)
## [1] 249
#lets see the new transformed data
datatable(barstoolNY)
Now lets see what we can clean for Jared’s data
jared_data_cleaned <- filter(jared_data,answer != 'Fair')
datatable(jared_data_cleaned)
Pizza_ DataInfiniti has the following isssues
So, let’s clean it
datafiniti_clean <- distinct(datafiniti) %>% filter(city == 'New York')
datatable(datafiniti_clean)
Also, one last thing we need to do before we are done with the initial cleaning of Datainfinity file. Look at the variable categories. It s a multivalue column. We can convert it into separate columns.
First find out how many columns do we need. Below command will tell us the row which has highest categories and hence we will create those many columns.
max(str_count(datafiniti_clean$categories,','))
## [1] 6
datafiniti_clean_unmerged <- datafiniti_clean %>%
separate(categories,into=c('cat1','cat2','cat3','cat4','cat5','cat6','cat7'),',')
# Lets take a look if that worked
datatable(head(datafiniti_clean_unmerged,5))
Now we can use this transformed table to find a restaurant with a particular category. For example you may want to find a place which is Kosher
Now since we have done the initial cleaning of the data we will look at the summary statistics of the dataset to find if anything needs to looked into or cleaned
#Glimpse will provide a summary of columns and datatype
glimpse(barstoolNY)
## Observations: 249
## Variables: 22
## $ name <fct> 99 Cent Fresh Pizza, Nino...
## $ address1 <fct> 473 Lexington Ave, 39 W 4...
## $ city <fct> New York, New York, New Y...
## $ zip <int> 10017, 10036, 10003, 1002...
## $ country <fct> US, US, US, US, US, US, U...
## $ latitude <dbl> 40.75370, 40.75690, 40.73...
## $ longitude <dbl> -73.97411, -73.98052, -73...
## $ price_level <int> 1, 2, 2, 1, 1, 2, 2, 2, 1...
## $ provider_rating <dbl> 4.0, 4.0, 3.0, 3.0, 4.0, ...
## $ provider_review_count <int> 118, 1055, 143, 95, 324, ...
## $ review_stats_all_average_score <dbl> 5.666667, 5.600000, 7.100...
## $ review_stats_all_count <int> 9, 2, 1, 5, 17, 14, 6, 37...
## $ review_stats_all_total_score <dbl> 51.0, 11.2, 7.1, 30.6, 12...
## $ review_stats_community_average_score <dbl> 5.762500, 0.000000, 0.000...
## $ review_stats_community_count <int> 8, 0, 0, 4, 16, 13, 4, 36...
## $ review_stats_community_total_score <dbl> 46.1, 0.0, 0.0, 24.5, 119...
## $ review_stats_critic_average_score <dbl> 0.0, 4.3, 0.0, 0.0, 0.0, ...
## $ review_stats_critic_count <int> 0, 1, 0, 0, 0, 0, 1, 0, 1...
## $ review_stats_critic_total_score <dbl> 0.0, 4.3, 0.0, 0.0, 0.0, ...
## $ review_stats_dave_average_score <dbl> 4.9, 6.9, 7.1, 6.1, 7.2, ...
## $ review_stats_dave_count <int> 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ review_stats_dave_total_score <dbl> 4.9, 6.9, 7.1, 6.1, 7.2, ...
glimpse(jared_data_cleaned)
## Observations: 374
## Variables: 9
## $ polla_qid <int> 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5,...
## $ answer <fct> Excellent, Good, Average, Poor, Never Again, Excel...
## $ votes <int> 0, 6, 4, 1, 2, 1, 1, 3, 1, 1, 4, 2, 1, 1, 0, 1, 1,...
## $ pollq_id <int> 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5,...
## $ question <fct> How was Pizza Mercato?, How was Pizza Mercato?, Ho...
## $ place <fct> Pizza Mercato, Pizza Mercato, Pizza Mercato, Pizza...
## $ time <int> 1344361527, 1344361527, 1344361527, 1344361527, 13...
## $ total_votes <int> 13, 13, 13, 13, 13, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, ...
## $ percent <dbl> 0.0000, 0.4615, 0.3077, 0.0769, 0.1538, 0.1429, 0....
glimpse(datafiniti_clean)
## Observations: 95
## Variables: 10
## $ name <fct> Salvos Pizzabar, East Broadway Pizza, Dona Bel...
## $ address <fct> 1477 York Ave, 181 E Broadway, 154 Church Stee...
## $ city <fct> New York, New York, New York, New York, New Yo...
## $ country <fct> US, US, US, US, US, US, US, US, US, US, US, US...
## $ province <fct> NY, NY, NY, NY, NY, NY, NY, NY, NY, NY, NY, NY...
## $ latitude <dbl> 40.77083, 40.71420, 40.71499, 40.79013, 40.710...
## $ longitude <dbl> -73.95102, -73.98939, -74.00785, -73.97320, -7...
## $ categories <fct> "Restaurant,Pizza Place Upper East Side,Pizza ...
## $ price_range_min <int> 25, 0, 0, 0, 25, 25, 25, 0, 25, 0, 0, 0, 25, 0...
## $ price_range_max <int> 40, 25, 25, 25, 40, 40, 40, 25, 40, 25, 25, 25...
Looking at the output of the glimpse will help up later if we need to change the datatype of a particular variable for exploratory analysis
Now lets dive into our data and look closely at variables(columns), its outlier and numerical summary of them.
Lets look at Barstool Data.
summary(barstoolNY)
## name address1 city
## Joe's Pizza : 2 268 W 23rd St: 2 New York :249
## Justino's Pizzeria: 2 936 8th Ave : 2 Alpharetta: 0
## Kiss My Slice : 2 1 Perry St : 1 Ann Arbor : 0
## Little Italy Pizza: 2 1005 2nd Ave : 1 Atlanta : 0
## Mariella Pizza : 2 101 W 68th St: 1 Augusta : 0
## Patsy's Pizzeria : 2 1055 1st Ave : 1 Austin : 0
## (Other) :237 (Other) :241 (Other) : 0
## zip country latitude longitude price_level
## Min. :10001 US:249 Min. :40.64 Min. :-74.13 Min. :0.000
## 1st Qu.:10011 1st Qu.:40.73 1st Qu.:-74.00 1st Qu.:1.000
## Median :10014 Median :40.75 Median :-73.99 Median :1.000
## Mean :10021 Mean :40.74 Mean :-73.99 Mean :1.398
## 3rd Qu.:10022 3rd Qu.:40.76 3rd Qu.:-73.98 3rd Qu.:2.000
## Max. :10280 Max. :40.80 Max. :-73.93 Max. :3.000
##
## provider_rating provider_review_count review_stats_all_average_score
## Min. :2.000 Min. : 4.0 Min. :0.100
## 1st Qu.:3.500 1st Qu.: 76.0 1st Qu.:6.019
## Median :3.500 Median : 162.0 Median :6.900
## Mean :3.606 Mean : 323.7 Mean :6.657
## 3rd Qu.:4.000 3rd Qu.: 345.0 3rd Qu.:7.658
## Max. :5.000 Max. :5797.0 Max. :8.967
##
## review_stats_all_count review_stats_all_total_score
## Min. : 1.0 Min. : 0.1
## 1st Qu.: 3.0 1st Qu.: 20.3
## Median : 6.0 Median : 39.7
## Mean : 16.3 Mean : 127.2
## 3rd Qu.: 13.0 3rd Qu.: 89.4
## Max. :568.0 Max. :5045.6
##
## review_stats_community_average_score review_stats_community_count
## Min. :0.000 Min. : 0.0
## 1st Qu.:5.725 1st Qu.: 2.0
## Median :6.900 Median : 5.0
## Mean :5.973 Mean : 15.1
## 3rd Qu.:7.650 3rd Qu.: 12.0
## Max. :8.942 Max. :567.0
##
## review_stats_community_total_score review_stats_critic_average_score
## Min. : 0.0 Min. : 0.000
## 1st Qu.: 12.5 1st Qu.: 0.000
## Median : 30.6 Median : 0.000
## Mean : 119.3 Mean : 1.348
## 3rd Qu.: 81.6 3rd Qu.: 0.000
## Max. :5036.3 Max. :11.000
##
## review_stats_critic_count review_stats_critic_total_score
## Min. :0.0000 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.: 0.000
## Median :0.0000 Median : 0.000
## Mean :0.2008 Mean : 1.444
## 3rd Qu.:0.0000 3rd Qu.: 0.000
## Max. :5.0000 Max. :29.800
##
## review_stats_dave_average_score review_stats_dave_count
## Min. :0.080 Min. :1
## 1st Qu.:5.900 1st Qu.:1
## Median :6.900 Median :1
## Mean :6.487 Mean :1
## 3rd Qu.:7.800 3rd Qu.:1
## Max. :9.300 Max. :1
##
## review_stats_dave_total_score
## Min. :0.080
## 1st Qu.:5.900
## Median :6.900
## Mean :6.487
## 3rd Qu.:7.800
## Max. :9.300
##
The summary output tells us a lot about the data. It also produce output based on datatype of the variable. For example if something is categorical/string variable we will see a set of values and it counts and its example is name variable.
Also, if a variable is numeric(real, integer, double) we will see some statistical output for it. example here is provider_review_count
A lot can found out by the previous output-
datatable(filter(barstoolNY,name=="Joe's Pizza"))
YOU SEE! These rows are not duplicate and infact are 2 diffrent locations of a Pizza chain. So we are good here. We will not merge these rows as you can see a stark diffrence in the rating between these 2 locations. Shouldn’t they taste the same?
Now, look at the numeric variable output and we clearly see there are outliers in the data. Let’s take a look at a particular example - provider_review_count. Look at the min and max value. There is such a vast diffrence and these outliers which are far from median can produce some inaccurate results if considered in the calculations.
Now the question comes- Should be keep them or not? The answer is IT DEPENDS! Ask your stakeholders if they really need the info about them or if is it ok we can ignore them in our calculations? Sometimes we can ignore them if we are looking at the overall picture or trying to fit a model. In our case we will keep only the outliers which are on the higher side because more the ratings more accurate the information will be. Ignoring the lower bound outliers will make sure that we do not include untrustworthy data. For example if some restaurant has only 4 reviews and have all 5 stars then you might think it is better than a restaurant with 4.2 rating but 1000 reviews. All those 4 reviews can be the paid reviews or from its owner itself.
so lets remove the lowerbound outliers -
# We will keep only those restaurants which have at least 75 reviewsby the provider. You can choose any other variable/review column as the filter
barstoolNY_Cleaned <- filter(barstoolNY,provider_review_count >= 75)
As we progress through our project, we may find out some more issues and can come back to this task of data cleaning. But as of now, I think we are good.
Merging/Joining
Also, since data is scattered in 3 different files we would be joining them to bring the data together.
Note- We will show the merging as part of final exam.
Ratings
By this time you would be knowing for sure that we are on a mission to find the best Pizza place in New York. But- As we all are different so do our preferences. Meaning of ‘Best Pizza’ is subjective and can include or exclude many factors.
So, we will try to find out the top 5 best pizza place based on -
All of these can be easily showed by using beautiful graphs made on ggplot2 Choice of graph - Barchart
1. Overall ratings
Just for the sake of simplicity, for this viz we wil use the average score of all reviews, which is ‘review_stats_all_average_score’ column in barstool dataset. Here we will make 2 charts. 1. Best pizza places among the restaurants present in Barstool. 2. Only those restaurants present in Jared’s dataset.
Now let’s see who has the highest excellent rantings based on Jared’s data and rating and check if those restaurants are rated highly by Barstool too.
Comparing this viz with previous viz tells us that Barstool Ratings are not in line with Jared’s ratings as none of the top restaurants from both graphs overlap.
2. Comparing ratings for top 5 restaurants.
Our first viz was based on the overall average score. Now for these top 5 restaurants we will see how they have been rated by - provider, community, critics and Barstool(Dave)
Finding - we can see that although Serafina Broadway restaurant is not the highest rated restaurant by provider, community and Barstool but it still tops the overall average score. The reason behind it that it has been highly rated by critic. So, this forces us to think whether critic was paid by the restaurant for good rating???
3. It is all about money. One might think highest rated restaurants might be expensive and so do I. So lets find out whether our top 5 restaurants are expensive or not.
Good news, not all top rated restaurants are expensive. Sauce Pizzeria has a lower price level. Thus students like me can enjoy best pizza for cheap.
4. Best Veg Restaurant
In united States, sometimes it is very hard to find Vegetarian food. So, for our vegetarian friends lets find a Pizza place with highest rating.
Aha! We have found a place for our friend Jack. (Remember he is the vegetarian guy )
Although, I am completing this project at this point, It doesnt mean that the journey of finding more information about pizza restaturants of NYC ends. There are a lot of things we could have included in this project but unfortunately I need more time and R coding knowledge to go about with it.
Before I bid adieu, lets summarize what we did. We started with a set of data files which had information about pizza places across US, their locations, ratings and price range. We set on a journey to find the best rated pizza places, compared them across different rating systems and also looked at the price range.
We also found out the best vegetarian pizza place for our friends like Jack.
I hope this information would be useful for the reader of this report and I wish them finding the tastiest pizza in NY.