The topic we chose is Pizza Ratings Data. The objective of this project is to find excellent pizza based on the location, review and price, and finally build a pizza-map focused on New York, which is known as the “Pizza Capital of the U.S.”.
We used the data source from Jared, Barstool, and DataInfiniti. Jared’s data is from top NY pizza restaurants, with a 6 point likert scale survey on ratings. The Barstool sports dataset has critic, public, and the Barstool Staff’s rating as well as pricing, location, and geo-location. There are 22 pizza places that overlap between the two datasets. DataFiniti includes 10000 pizza places, their price ranges and geo-locations. (More Description on GitHub) The current problem is, how to target on the favorite pizza through these complex dataset? Let’s begin!
In this midterm report and the following final report, the approach/analytic technique we used or planned to use including:
For this project, most of the packages we used are from class. This involves the reading, tidying and manipulation of the data from our pizza set. The main unique packages are leaflet
, maps
, and DataExplorer
. The first two packages work hand in hand with each other in order to give us the map visualization, and the DataExplorer package is used for EDA purpose.
Please just install these packages at your discretion.
library(readr) ## Reading in data
library(tidyverse) ## Tidying the data
library(dplyr) ## Manipulating data
library(DT) ## Outputting data in a clean format
library(magrittr) ## Pipe Operaters
library(leaflet) ## Map visualizations
library(maps) ## Placing maps under leaflet
library(DataExplorer) ## Exploratory data analysis
Click here to download the original data source
jared <- read_csv("C:/Users/Zhuo Chen/Downloads/Data Wrangling with R/Project/pizza_jared.csv")
barstool <- read_csv("C:/Users/Zhuo Chen/Downloads/Data Wrangling with R/Project/pizza_barstool.csv")
datafiniti <- read_csv("C:/Users/Zhuo Chen/Downloads/Data Wrangling with R/Project/pizza_datafiniti.csv")
This dataset is from Jared Lander (Twitter), who is the Chief Data Scientist of Lander Analytics, the author of R for Everyone, as well as a famous Pizza Tour guide. He built this database based on customer votes of pizza restaurants. This dataset was updated on 2019/10/01.
dim(jared)
## [1] 375 9
colnames(jared)
## [1] "polla_qid" "answer" "votes" "pollq_id" "question"
## [6] "place" "time" "total_votes" "percent"
class(jared)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
view(jared)
colSums(is.na(jared))
## polla_qid answer votes pollq_id question place
## 0 0 0 0 0 0
## time total_votes percent
## 0 0 5
The above summary gives us a lot of insight into what we are dealing with. The original has 9 variables.
variable | class | description |
---|---|---|
polla_qid | integer | Quiz ID |
answer | character | Answer (likert scale) |
votes | integer | Number of votes for that question/answer combo |
pollq_id | integer | Poll Question ID |
question | character | Question |
place | character | Pizza Place |
time | integer | Time of quiz |
total_votes | integer | Total number of votes for that pizza place |
percent | double | Vote percent of total for that pizza place |
The final command of colsums shows us how many missing values we have in the data set. Upon viewing this, we can see that there are only five missing values. Upon a closer look, we saw that this was one pizza place that did not have any votes. In order to clean the data, we simply deleted these rows to keep the emphasis of this data set on the ratings.
Following these changes, we wanted to make sure that Place only has one observation instead of five. Using the lines below, it was a quick change to clean the data.
#turning spreading the data set for one row per pizza place
jared_2 <- jared[,-c(4,5,7:9)]
jared_3 <- jared_2 %>%group_by(place)
jared_4 <- jared_3[,c(4,1,2,3)]
jared_5 <- jared_4%>%spread(answer, votes)
Finally, we have a clean data set that has the voting breakdown for each place.
DT::datatable(jared_5)
The Barstool sports dataset has critic, public, and the Barstool Staff’s rating as well as pricing, location, and geo-location. This dataset is collected in 7 months ago to pull, map, and analyze the data behind Barstool’s One Bite pizza application. Moving onto this interesting data set, we wanted to do another simple dive into the data to understand what is all going on.
dim(barstool)
## [1] 463 22
colnames(barstool)
## [1] "name"
## [2] "address1"
## [3] "city"
## [4] "zip"
## [5] "country"
## [6] "latitude"
## [7] "longitude"
## [8] "price_level"
## [9] "provider_rating"
## [10] "provider_review_count"
## [11] "review_stats_all_average_score"
## [12] "review_stats_all_count"
## [13] "review_stats_all_total_score"
## [14] "review_stats_community_average_score"
## [15] "review_stats_community_count"
## [16] "review_stats_community_total_score"
## [17] "review_stats_critic_average_score"
## [18] "review_stats_critic_count"
## [19] "review_stats_critic_total_score"
## [20] "review_stats_dave_average_score"
## [21] "review_stats_dave_count"
## [22] "review_stats_dave_total_score"
class(barstool)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
view(barstool)
colSums(is.na(barstool))
## name address1
## 0 0
## city zip
## 0 0
## country latitude
## 0 2
## longitude price_level
## 2 0
## provider_rating provider_review_count
## 0 0
## review_stats_all_average_score review_stats_all_count
## 0 0
## review_stats_all_total_score review_stats_community_average_score
## 0 0
## review_stats_community_count review_stats_community_total_score
## 0 0
## review_stats_critic_average_score review_stats_critic_count
## 0 0
## review_stats_critic_total_score review_stats_dave_average_score
## 0 0
## review_stats_dave_count review_stats_dave_total_score
## 0 0
This data set came through very similar to the last. There are 22 variables.
variable | class | description |
---|---|---|
name | character | Pizza place name |
address1 | character | Pizza place address |
city | character | City |
zip | double | Zip |
country | character | Country |
latitude | double | Latitude |
longitude | double | Longitude |
price_level | double | Price rating (fewer $ = cheaper, more $$$ = expensive) |
provider_rating | double | Provider review score |
provider_review_count | double | Provider review count |
review_stats_all_average_score | double | Average Score |
review_stats_all_count | double | Count of all reviews |
review_stats_all_total_score | double | Review total score |
review_stats_community_average_score | double | Community average score |
review_stats_community_count | double | community review count |
review_stats_community_total_score | double | community review total score |
review_stats_critic_average_score | double | Critic average score |
review_stats_critic_count | double | Critic review count |
review_stats_critic_total_score | double | Critic total score |
review_stats_dave_average_score | double | Dave (Barstool) average score |
review_stats_dave_count | double | Dave review count |
review_stats_dave_total_score | double | Dave total score |
Pretty clean data with the exception of the of a few missing values. While R has a package that can generate a long/lat, we physcially plugged these in for the only two missing coordinates.
#finding mising long lat for barstool
barstool[6, 6]=40.749272
barstool[6, 7]=-73.995482
barstool[266, 6]=40.723062
barstool[266, 7]=-73.996233
After running this, we are happy with the clean and full dataset. It is a great data set shown below.
DT::datatable(barstool)
Datafiniti Dataset is a list of over 3,500 pizzas from multiple restaurants provided by Datafiniti’s Business Database. The dataset includes the category, name, address, city, state, menu information, price range, and more for each pizza restaurant. This dataset is updated in 6 months ago.
With the final dataset, we had a lot of clean data. For our purposes, we did not have much to ass to this set. We ran initial data understanding.
dim(datafiniti)
## [1] 10000 10
colnames(datafiniti)
## [1] "name" "address" "city"
## [4] "country" "province" "latitude"
## [7] "longitude" "categories" "price_range_min"
## [10] "price_range_max"
class(datafiniti)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
view(datafiniti)
colSums(is.na(datafiniti))
## name address city country
## 0 0 0 0
## province latitude longitude categories
## 0 0 0 0
## price_range_min price_range_max
## 0 0
There are 10 variables in this dataset.
variable | class | description |
---|---|---|
name | character | Pizza place |
address | character | Address |
city | character | City |
country | character | Country |
province | character | State |
latitude | double | Latitude |
longitude | double | Longitude |
categories | character | Restaurant category |
price_range_min | double | Price range min |
price_range_max | double | Price range max |
After reviewing this, here is the data table for that information.
DT::datatable(datafiniti)
We applied DataExplorer
package as a fast and efficient way to do typical basic EDA. Through the Plot Bar chart of Datafiniti, we found that New York is the dominant state (province) in the dataset.
plot_bar(datafiniti$province)
To check the missing values before diving deep into analysis, we use the plot_missing
function. From the plot of Jared dataset, we found that most of “Fair” value were missed, which is suggested to be removed.
plot_missing(jared_5)
Then we use the Histogram plot to see the distribution range of continous variables. Through the histograms below, we could see that there is no obvious outliers. For example, all of the ratings and reviews are lied in 0-5 or 0-10, and there is no unreasonable values of latitude, longtitude and price range. Thus, there is no need to cap outliers in this step. Capping outliers will be considered again after the three datasets are joined as one.
plot_histogram(datafiniti)
plot_histogram(barstool)
The bivariate/multivariate analysis comes with the Correlation Analysis.
plot_correlation(jared_5)
plot_correlation(datafiniti)
In order to achieve our goal of truly understanding where the best pizza in New York is, we are looking to make some changes to our data set. Through our initial data analysis, we found a lot of good information proving that our concentration of pizza restaurants is in.
Now taking our three data sets, we are going to join these together to make one New York Pizza. Using this data set, we will be able to create one that has ratings and price ranges to compliment all of the data analysis we are looking to do. We will learn how to join the datasets and how to create map in R using latitude and longtitude data.
With this data set, we are looking to try and use some more machine learning aspects to be more scientific in our analysis. Through the use of regression models and cluster analysis, we are going to look into how ratings, pricing and location are able to interact in order to find the best pockets of pizza in the city.
Here are some topics we hope to explore more insights on the final project, especially based on our New York Pizza Map: