Introduction

The topic we chose is Pizza Ratings Data. The objective of this project is to find excellent pizza based on the location, review and price, and finally build a pizza-map focused on New York, which is known as the “Pizza Capital of the U.S.”.

We used the data source from Jared, Barstool, and DataInfiniti. Jared’s data is from top NY pizza restaurants, with a 6 point likert scale survey on ratings. The Barstool sports dataset has critic, public, and the Barstool Staff’s rating as well as pricing, location, and geo-location. There are 22 pizza places that overlap between the two datasets. DataFiniti includes 10000 pizza places, their price ranges and geo-locations. (More Description on GitHub) The current problem is, how to target on the favorite pizza through these complex dataset? Let’s begin!

In this midterm report and the following final report, the approach/analytic technique we used or planned to use including:

Packages to Download

For this project, most of the packages we used are from class. This involves the reading, tidying and manipulation of the data from our pizza set. The main unique packages are leaflet, maps, and DataExplorer. The first two packages work hand in hand with each other in order to give us the map visualization, and the DataExplorer package is used for EDA purpose.

Please just install these packages at your discretion.

library(readr) ## Reading in data
library(tidyverse) ## Tidying the data
library(dplyr) ## Manipulating data
library(DT) ## Outputting data in a clean format
library(magrittr) ## Pipe Operaters
library(leaflet) ## Map visualizations
library(maps) ## Placing maps under leaflet
library(DataExplorer) ## Exploratory data analysis

Data Preparation

Data Introduction

Click here to download the original data source

jared <- read_csv("C:/Users/Zhuo Chen/Downloads/Data Wrangling with R/Project/pizza_jared.csv")
barstool <- read_csv("C:/Users/Zhuo Chen/Downloads/Data Wrangling with R/Project/pizza_barstool.csv")
datafiniti <- read_csv("C:/Users/Zhuo Chen/Downloads/Data Wrangling with R/Project/pizza_datafiniti.csv")

Jared Data Set

This dataset is from Jared Lander (Twitter), who is the Chief Data Scientist of Lander Analytics, the author of R for Everyone, as well as a famous Pizza Tour guide. He built this database based on customer votes of pizza restaurants. This dataset was updated on 2019/10/01.

dim(jared) 
## [1] 375   9
colnames(jared)
## [1] "polla_qid"   "answer"      "votes"       "pollq_id"    "question"   
## [6] "place"       "time"        "total_votes" "percent"
class(jared)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
view(jared)
colSums(is.na(jared))
##   polla_qid      answer       votes    pollq_id    question       place 
##           0           0           0           0           0           0 
##        time total_votes     percent 
##           0           0           5

The above summary gives us a lot of insight into what we are dealing with. The original has 9 variables.

variable class description
polla_qid integer Quiz ID
answer character Answer (likert scale)
votes integer Number of votes for that question/answer combo
pollq_id integer Poll Question ID
question character Question
place character Pizza Place
time integer Time of quiz
total_votes integer Total number of votes for that pizza place
percent double Vote percent of total for that pizza place

The final command of colsums shows us how many missing values we have in the data set. Upon viewing this, we can see that there are only five missing values. Upon a closer look, we saw that this was one pizza place that did not have any votes. In order to clean the data, we simply deleted these rows to keep the emphasis of this data set on the ratings.

Following these changes, we wanted to make sure that Place only has one observation instead of five. Using the lines below, it was a quick change to clean the data.

#turning spreading the data set for one row per pizza place
jared_2 <- jared[,-c(4,5,7:9)]
jared_3 <- jared_2 %>%group_by(place) 
jared_4 <- jared_3[,c(4,1,2,3)]
jared_5 <- jared_4%>%spread(answer, votes)

Finally, we have a clean data set that has the voting breakdown for each place.

DT::datatable(jared_5)

Barstool Data Set

The Barstool sports dataset has critic, public, and the Barstool Staff’s rating as well as pricing, location, and geo-location. This dataset is collected in 7 months ago to pull, map, and analyze the data behind Barstool’s One Bite pizza application. Moving onto this interesting data set, we wanted to do another simple dive into the data to understand what is all going on.

dim(barstool)
## [1] 463  22
colnames(barstool)
##  [1] "name"                                
##  [2] "address1"                            
##  [3] "city"                                
##  [4] "zip"                                 
##  [5] "country"                             
##  [6] "latitude"                            
##  [7] "longitude"                           
##  [8] "price_level"                         
##  [9] "provider_rating"                     
## [10] "provider_review_count"               
## [11] "review_stats_all_average_score"      
## [12] "review_stats_all_count"              
## [13] "review_stats_all_total_score"        
## [14] "review_stats_community_average_score"
## [15] "review_stats_community_count"        
## [16] "review_stats_community_total_score"  
## [17] "review_stats_critic_average_score"   
## [18] "review_stats_critic_count"           
## [19] "review_stats_critic_total_score"     
## [20] "review_stats_dave_average_score"     
## [21] "review_stats_dave_count"             
## [22] "review_stats_dave_total_score"
class(barstool)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
view(barstool)
colSums(is.na(barstool))
##                                 name                             address1 
##                                    0                                    0 
##                                 city                                  zip 
##                                    0                                    0 
##                              country                             latitude 
##                                    0                                    2 
##                            longitude                          price_level 
##                                    2                                    0 
##                      provider_rating                provider_review_count 
##                                    0                                    0 
##       review_stats_all_average_score               review_stats_all_count 
##                                    0                                    0 
##         review_stats_all_total_score review_stats_community_average_score 
##                                    0                                    0 
##         review_stats_community_count   review_stats_community_total_score 
##                                    0                                    0 
##    review_stats_critic_average_score            review_stats_critic_count 
##                                    0                                    0 
##      review_stats_critic_total_score      review_stats_dave_average_score 
##                                    0                                    0 
##              review_stats_dave_count        review_stats_dave_total_score 
##                                    0                                    0

This data set came through very similar to the last. There are 22 variables.

variable class description
name character Pizza place name
address1 character Pizza place address
city character City
zip double Zip
country character Country
latitude double Latitude
longitude double Longitude
price_level double Price rating (fewer $ = cheaper, more $$$ = expensive)
provider_rating double Provider review score
provider_review_count double Provider review count
review_stats_all_average_score double Average Score
review_stats_all_count double Count of all reviews
review_stats_all_total_score double Review total score
review_stats_community_average_score double Community average score
review_stats_community_count double community review count
review_stats_community_total_score double community review total score
review_stats_critic_average_score double Critic average score
review_stats_critic_count double Critic review count
review_stats_critic_total_score double Critic total score
review_stats_dave_average_score double Dave (Barstool) average score
review_stats_dave_count double Dave review count
review_stats_dave_total_score double Dave total score

Pretty clean data with the exception of the of a few missing values. While R has a package that can generate a long/lat, we physcially plugged these in for the only two missing coordinates.

#finding mising long lat for barstool
barstool[6, 6]=40.749272
barstool[6, 7]=-73.995482
barstool[266, 6]=40.723062
barstool[266, 7]=-73.996233

After running this, we are happy with the clean and full dataset. It is a great data set shown below.

DT::datatable(barstool)

Datafiniti Data Set

Datafiniti Dataset is a list of over 3,500 pizzas from multiple restaurants provided by Datafiniti’s Business Database. The dataset includes the category, name, address, city, state, menu information, price range, and more for each pizza restaurant. This dataset is updated in 6 months ago.

With the final dataset, we had a lot of clean data. For our purposes, we did not have much to ass to this set. We ran initial data understanding.

dim(datafiniti)
## [1] 10000    10
colnames(datafiniti)
##  [1] "name"            "address"         "city"           
##  [4] "country"         "province"        "latitude"       
##  [7] "longitude"       "categories"      "price_range_min"
## [10] "price_range_max"
class(datafiniti)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
view(datafiniti)
colSums(is.na(datafiniti))
##            name         address            city         country 
##               0               0               0               0 
##        province        latitude       longitude      categories 
##               0               0               0               0 
## price_range_min price_range_max 
##               0               0

There are 10 variables in this dataset.

variable class description
name character Pizza place
address character Address
city character City
country character Country
province character State
latitude double Latitude
longitude double Longitude
categories character Restaurant category
price_range_min double Price range min
price_range_max double Price range max

After reviewing this, here is the data table for that information.

DT::datatable(datafiniti)

Exploratory Data Analysis

We applied DataExplorer package as a fast and efficient way to do typical basic EDA. Through the Plot Bar chart of Datafiniti, we found that New York is the dominant state (province) in the dataset.

plot_bar(datafiniti$province)

To check the missing values before diving deep into analysis, we use the plot_missing function. From the plot of Jared dataset, we found that most of “Fair” value were missed, which is suggested to be removed.

plot_missing(jared_5)

Then we use the Histogram plot to see the distribution range of continous variables. Through the histograms below, we could see that there is no obvious outliers. For example, all of the ratings and reviews are lied in 0-5 or 0-10, and there is no unreasonable values of latitude, longtitude and price range. Thus, there is no need to cap outliers in this step. Capping outliers will be considered again after the three datasets are joined as one.

plot_histogram(datafiniti)

plot_histogram(barstool)

The bivariate/multivariate analysis comes with the Correlation Analysis.

plot_correlation(jared_5)

plot_correlation(datafiniti)

In order to achieve our goal of truly understanding where the best pizza in New York is, we are looking to make some changes to our data set. Through our initial data analysis, we found a lot of good information proving that our concentration of pizza restaurants is in.

Now taking our three data sets, we are going to join these together to make one New York Pizza. Using this data set, we will be able to create one that has ratings and price ranges to compliment all of the data analysis we are looking to do. We will learn how to join the datasets and how to create map in R using latitude and longtitude data.

With this data set, we are looking to try and use some more machine learning aspects to be more scientific in our analysis. Through the use of regression models and cluster analysis, we are going to look into how ratings, pricing and location are able to interact in order to find the best pockets of pizza in the city.

Final Project Forecast

Here are some topics we hope to explore more insights on the final project, especially based on our New York Pizza Map: