Pizza Breakdown

Introduction

The topic we chose is Pizza Ratings Data. The objective of this project is to find excellent pizza based on the location, review and price, and finally build a pizza-map focused on New York, which is known as the “Pizza Capital of the U.S.”.

We used the data source from Jared, Barstool, and DataInfiniti. Jared’s data is from top NY pizza restaurants, with a 6 point likert scale survey on ratings. The Barstool sports dataset has critic, public, and the Barstool Staff’s rating as well as pricing, location, and geo-location. There are 22 pizza places that overlap between the two datasets. DataFiniti includes 10000 pizza places, their price ranges and geo-locations. (More Description on GitHub) The current problem is, how to target on the favorite pizza through these complex dataset? Let’s begin!

In this midterm report and the following final report, the approach/analytic technique we used or planned to use including:

Data cleaning
EDA to detect distribution ranges, percentage of missing values and outliers
Join three datasets to create one that has ratings and price ranges to compliment all of the data analysis
Plot the Pizza Map that consumer could easily filter their potential favorite pizza restaurant based on their specified demand (i.e. price, review, and location)
Apply machine learning techniques (i.e. regression models and cluster analysis) in order to find the best pockets of pizza in the city

Packages to Download

For this project, most of the packages we used are from class. This involves the reading, tidying and manipulation of the data from our pizza set. The main unique packages are leaflet, maps, and DataExplorer. The first two packages work hand in hand with each other in order to give us the map visualization, and the DataExplorer package is used for EDA purpose.

Please just install these packages at your discretion.

library(readr) ## Reading in data
library(tidyverse) ## Tidying the data
library(dplyr) ## Manipulating data
library(DT) ## Outputting data in a clean format
library(magrittr) ## Pipe Operaters
library(leaflet) ## Map visualizations
library(maps) ## Placing maps under leaflet
library(DataExplorer) ## Exploratory data analysis

Data Preparation

Data Introduction

Click here to download the original data source

jared <- read_csv("C:/Users/Zhuo Chen/Downloads/Data Wrangling with R/Project/pizza_jared.csv")
barstool <- read_csv("C:/Users/Zhuo Chen/Downloads/Data Wrangling with R/Project/pizza_barstool.csv")
datafiniti <- read_csv("C:/Users/Zhuo Chen/Downloads/Data Wrangling with R/Project/pizza_datafiniti.csv")

Jared Data Set

This dataset is from Jared Lander (Twitter), who is the Chief Data Scientist of Lander Analytics, the author of R for Everyone, as well as a famous Pizza Tour guide. He built this database based on customer votes of pizza restaurants. This dataset was updated on 2019/10/01.

dim(jared)

## [1] 375   9

colnames(jared)

## [1] "polla_qid"   "answer"      "votes"       "pollq_id"    "question"   
## [6] "place"       "time"        "total_votes" "percent"

class(jared)

## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

view(jared)
colSums(is.na(jared))

##   polla_qid      answer       votes    pollq_id    question       place 
##           0           0           0           0           0           0 
##        time total_votes     percent 
##           0           0           5

The above summary gives us a lot of insight into what we are dealing with. The original has 9 variables.

variable	class	description
polla_qid	integer	Quiz ID
answer	character	Answer (likert scale)
votes	integer	Number of votes for that question/answer combo
pollq_id	integer	Poll Question ID
question	character	Question
place	character	Pizza Place
time	integer	Time of quiz
total_votes	integer	Total number of votes for that pizza place
percent	double	Vote percent of total for that pizza place

The final command of colsums shows us how many missing values we have in the data set. Upon viewing this, we can see that there are only five missing values. Upon a closer look, we saw that this was one pizza place that did not have any votes. In order to clean the data, we simply deleted these rows to keep the emphasis of this data set on the ratings.

Following these changes, we wanted to make sure that Place only has one observation instead of five. Using the lines below, it was a quick change to clean the data.

#turning spreading the data set for one row per pizza place
jared_2 <- jared[,-c(4,5,7:9)]
jared_3 <- jared_2 %>%group_by(place) 
jared_4 <- jared_3[,c(4,1,2,3)]
jared_5 <- jared_4%>%spread(answer, votes)

Finally, we have a clean data set that has the voting breakdown for each place.

DT::datatable(jared_5)

Barstool Data Set

The Barstool sports dataset has critic, public, and the Barstool Staff’s rating as well as pricing, location, and geo-location. This dataset is collected in 7 months ago to pull, map, and analyze the data behind Barstool’s One Bite pizza application. Moving onto this interesting data set, we wanted to do another simple dive into the data to understand what is all going on.

dim(barstool)

## [1] 463  22

colnames(barstool)

##  [1] "name"                                
##  [2] "address1"                            
##  [3] "city"                                
##  [4] "zip"                                 
##  [5] "country"                             
##  [6] "latitude"                            
##  [7] "longitude"                           
##  [8] "price_level"                         
##  [9] "provider_rating"                     
## [10] "provider_review_count"               
## [11] "review_stats_all_average_score"      
## [12] "review_stats_all_count"              
## [13] "review_stats_all_total_score"        
## [14] "review_stats_community_average_score"
## [15] "review_stats_community_count"        
## [16] "review_stats_community_total_score"  
## [17] "review_stats_critic_average_score"   
## [18] "review_stats_critic_count"           
## [19] "review_stats_critic_total_score"     
## [20] "review_stats_dave_average_score"     
## [21] "review_stats_dave_count"             
## [22] "review_stats_dave_total_score"

class(barstool)

## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

view(barstool)
colSums(is.na(barstool))

##                                 name                             address1 
##                                    0                                    0 
##                                 city                                  zip 
##                                    0                                    0 
##                              country                             latitude 
##                                    0                                    2 
##                            longitude                          price_level 
##                                    2                                    0 
##                      provider_rating                provider_review_count 
##                                    0                                    0 
##       review_stats_all_average_score               review_stats_all_count 
##                                    0                                    0 
##         review_stats_all_total_score review_stats_community_average_score 
##                                    0                                    0 
##         review_stats_community_count   review_stats_community_total_score 
##                                    0                                    0 
##    review_stats_critic_average_score            review_stats_critic_count 
##                                    0                                    0 
##      review_stats_critic_total_score      review_stats_dave_average_score 
##                                    0                                    0 
##              review_stats_dave_count        review_stats_dave_total_score 
##                                    0                                    0

This data set came through very similar to the last. There are 22 variables.

variable	class	description
name	character	Pizza place name
address1	character	Pizza place address
city	character	City
zip	double	Zip
country	character	Country
latitude	double	Latitude
longitude	double	Longitude
price_level	double	Price rating (fewer `$` = cheaper, more `$$$` = expensive)
provider_rating	double	Provider review score
provider_review_count	double	Provider review count
review_stats_all_average_score	double	Average Score
review_stats_all_count	double	Count of all reviews
review_stats_all_total_score	double	Review total score
review_stats_community_average_score	double	Community average score
review_stats_community_count	double	community review count
review_stats_community_total_score	double	community review total score
review_stats_critic_average_score	double	Critic average score
review_stats_critic_count	double	Critic review count
review_stats_critic_total_score	double	Critic total score
review_stats_dave_average_score	double	Dave (Barstool) average score
review_stats_dave_count	double	Dave review count
review_stats_dave_total_score	double	Dave total score

Pretty clean data with the exception of the of a few missing values. While R has a package that can generate a long/lat, we physcially plugged these in for the only two missing coordinates.

#finding mising long lat for barstool
barstool[6, 6]=40.749272
barstool[6, 7]=-73.995482
barstool[266, 6]=40.723062
barstool[266, 7]=-73.996233

After running this, we are happy with the clean and full dataset. It is a great data set shown below.

DT::datatable(barstool)

Datafiniti Data Set

Datafiniti Dataset is a list of over 3,500 pizzas from multiple restaurants provided by Datafiniti’s Business Database. The dataset includes the category, name, address, city, state, menu information, price range, and more for each pizza restaurant. This dataset is updated in 6 months ago.

With the final dataset, we had a lot of clean data. For our purposes, we did not have much to ass to this set. We ran initial data understanding.

dim(datafiniti)

## [1] 10000    10

colnames(datafiniti)

##  [1] "name"            "address"         "city"           
##  [4] "country"         "province"        "latitude"       
##  [7] "longitude"       "categories"      "price_range_min"
## [10] "price_range_max"

class(datafiniti)

## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

view(datafiniti)
colSums(is.na(datafiniti))

##            name         address            city         country 
##               0               0               0               0 
##        province        latitude       longitude      categories 
##               0               0               0               0 
## price_range_min price_range_max 
##               0               0

There are 10 variables in this dataset.

variable	class	description
name	character	Pizza place
address	character	Address
city	character	City
country	character	Country
province	character	State
latitude	double	Latitude
longitude	double	Longitude
categories	character	Restaurant category
price_range_min	double	Price range min
price_range_max	double	Price range max

After reviewing this, here is the data table for that information.

DT::datatable(datafiniti)

Exploratory Data Analysis

We applied DataExplorer package as a fast and efficient way to do typical basic EDA. Through the Plot Bar chart of Datafiniti, we found that New York is the dominant state (province) in the dataset.

plot_bar(datafiniti$province)

To check the missing values before diving deep into analysis, we use the plot_missing function. From the plot of Jared dataset, we found that most of “Fair” value were missed, which is suggested to be removed.

plot_missing(jared_5)

Then we use the Histogram plot to see the distribution range of continous variables. Through the histograms below, we could see that there is no obvious outliers. For example, all of the ratings and reviews are lied in 0-5 or 0-10, and there is no unreasonable values of latitude, longtitude and price range. Thus, there is no need to cap outliers in this step. Capping outliers will be considered again after the three datasets are joined as one.

plot_histogram(datafiniti)

plot_histogram(barstool)

The bivariate/multivariate analysis comes with the Correlation Analysis.

plot_correlation(jared_5)

plot_correlation(datafiniti)

In order to achieve our goal of truly understanding where the best pizza in New York is, we are looking to make some changes to our data set. Through our initial data analysis, we found a lot of good information proving that our concentration of pizza restaurants is in.

Now taking our three data sets, we are going to join these together to make one New York Pizza. Using this data set, we will be able to create one that has ratings and price ranges to compliment all of the data analysis we are looking to do. We will learn how to join the datasets and how to create map in R using latitude and longtitude data.

With this data set, we are looking to try and use some more machine learning aspects to be more scientific in our analysis. Through the use of regression models and cluster analysis, we are going to look into how ratings, pricing and location are able to interact in order to find the best pockets of pizza in the city.

Final Project Forecast

Here are some topics we hope to explore more insights on the final project, especially based on our New York Pizza Map:

Which city is the real “Pizza Capital in the U.S.”, considering the amount of restaurant, average price, and overall reviews?
Are there any relationships between location, price and review/ratings?
What will be the best pizza region in New York for
- The first time foriegn visitor
- Students from Columbia University holding a party after finals
- Wall Street stock analyst looking for a quick lunch
- Jay-Z and Beyonce celebrating their anniversary