Introduction

In the analysis of these 3 datasets, we are addressing average rating, location, and pricing information of various pizza establishments around the US. The goal is to assist the average American consumer in making informed decisions about which pizza restaurants they would like to patronize.

Data Introduction

This analysis is based on 3 datasets with different information about pizza restaurants in the US:

  • Barstool - Contains critic, public and the barstool staff’s rating as well as pricing, location and geolocation of the different pizza places.

  • Jared - Contains ratings for New York pizza restaurants on a 6-point likert scale.

  • Datafiniti - Contains price range, location and keywords for pizza restaurants around the US.

  • Region - Contains state by region from the US Census Bureau

To learn more about the datasets used, click here or view a more detailed explanation in the Data Preparation section below.

Approach

Our approach was to divide the analysis to focus on two groups of consumers: 1) those who are price-conscious and 2) those who are interested in finding the pizza place near them with the highest rating.

Packages

library(tidyverse) # data manipulation and cleaning
library(knitr) # data dictionary tables
library(kableExtra) # table formatting
library(readxl) # importing dataset descriptions
library(leaflet) # interactive map & geolocation

Data Preparation

Background

The pizza data was originally compiled on September 30th, 2019 as a part of a social data project called TidyTuesday. The original intent is for R users to practice their creativity and analytical techniques, specifically with the tidyverse package.

For those who are interested, there’s a 5-minute podcast reviewing the top entries and different visualization techniques for Week 7’s project called “Pizza Party”.

We created the regions dataset based on the definitions from the US Census Bureau.

Data Dictionary

Jared Dataset
variable class description
polla_qid integer Quiz ID
answer character Answer (likert scale)
votes integer Number of votes for that question/answer combo
pollq_id integer Poll Question ID
question character Question
place character Pizza Place
time integer Time of quiz
total_votes integer Total number of votes for that pizza place
percent double Vote percent of total for that pizza place
Barstool Dataset
variable class description
name character Pizza place name
address1 character Pizza place address
city character City
zip double Zip
country character Country
latitude double Latitude
longitude double Longitude
price_level double Price rating (smaller = cheaper)
provider_rating double Provider review score
provider_review_count double Provider review count
review_stats_all_average_score double Average Score
review_stats_all_count double Count of all reviews
review_stats_all_total_score double Review total score
review_stats_community_average_score double Community average score
review_stats_community_count double community review count
review_stats_community_total_score double community review total score
review_stats_critic_average_score double Critic average score
review_stats_critic_count double Critic review count
review_stats_critic_total_score double Critic total score
review_stats_dave_average_score double Dave (Barstool) average score
review_stats_dave_count double Dave review count
review_stats_dave_total_score double Dave total score
Datafiniti Dataset
variable class description
polla_qid integer Quiz ID
answer character Answer (likert scale)
votes integer Number of votes for that question/answer combo
pollq_id integer Poll Question ID
question character Question
place character Pizza Place
time integer Time of quiz
total_votes integer Total number of votes for that pizza place
percent double Vote percent of total for that pizza place

NA Values

colSums(is.na(barstool))
##                                 name                             address1 
##                                    0                                    0 
##                                 city                                  zip 
##                                    0                                    0 
##                              country                             latitude 
##                                    0                                    2 
##                            longitude                          price_level 
##                                    2                                    0 
##                      provider_rating                provider_review_count 
##                                    0                                    0 
##       review_stats_all_average_score               review_stats_all_count 
##                                    0                                    0 
##         review_stats_all_total_score review_stats_community_average_score 
##                                    0                                    0 
##         review_stats_community_count   review_stats_community_total_score 
##                                    0                                    0 
##    review_stats_critic_average_score            review_stats_critic_count 
##                                    0                                    0 
##      review_stats_critic_total_score      review_stats_dave_average_score 
##                                    0                                    0 
##              review_stats_dave_count        review_stats_dave_total_score 
##                                    0                                    0
colSums(is.na(jared))
##   polla_qid      answer       votes    pollq_id    question       place 
##           0           0           0           0           0           0 
##        time total_votes     percent 
##           0           0           5
colSums(is.na(datafiniti))
##            name         address            city         country 
##               0               0               0               0 
##        province        latitude       longitude      categories 
##               0               0               0               0 
## price_range_min price_range_max 
##               0               0
colSums(is.na(region))
##      State State Code     Region   Division 
##          0          0          0          0

Jared & Barstool Cleaning

Likert Scale Conversion

The jared dataset pizza ratings are on a 1 to 6 likert scale. However, barstool is on a 1-10 scale. We used a mutate function to convert the 1-6 scale to a 1-10 scale:

jared <- jared %>%
  mutate(answer = case_when(
    .$answer=="Never Again" ~ 0, 
    .$answer=="Poor" ~ 2,
    .$answer=="Fair" ~ 4,
    .$answer=="Average" ~ 6,
    .$answer=="Good"~ 8,
    .$answer=="Excellent" ~ 10))

jared$answer <- as.integer(jared$answer)
str(jared$answer)
##  int [1:375] 10 8 6 2 0 10 8 6 2 0 ...

Once the answers are converted to integers, a calculation is done to find the average rating for each pizza place:

jared <- mutate(jared,Weighted_Rating = answer*votes)


(Jared_Average <- jared %>%
    group_by(place) %>%
    summarise(avg_score = sum(Weighted_Rating)/sum(votes)))
## # A tibble: 56 x 2
##    place                       avg_score
##    <chr>                           <dbl>
##  1 5 Boroughs Pizza                 7.33
##  2 Artichoke Basille's Pizza        8   
##  3 Arturo's                         7.43
##  4 Bella Napoli                     7.07
##  5 Ben's of SoHo 14th Street        4.8 
##  6 Ben's of SoHo Spring Street      6.44
##  7 Big Slice Pizza                  6.27
##  8 Bleecker Street Pizza            8.29
##  9 Bravo Pizza                    NaN   
## 10 Cavallo's Pizza                  7.27
## # ... with 46 more rows
(Jared_Mean <- mean(Jared_Average$avg_score, na.rm = T))
## [1] 6.802782

Two columns are dropped for being insignificant:

jared$polla_qid = NULL
jared$time = NULL
head(jared)

This calculation takes the mean of average scores omiting the 0 values

Critic_average <- mean(NA^(barstool$review_stats_critic_average_score == 0)*barstool$review_stats_critic_average_score, na.rm = TRUE)
Daves_average <- mean(NA^(barstool$review_stats_dave_average_score == 0)*barstool$review_stats_dave_average_score, na.rm = TRUE)
Community_average <- mean(NA^(barstool$review_stats_community_average_score == 0)*barstool$review_stats_community_average_score, na.rm = TRUE)

The three calculations below allow to compare to average scores found in the barstool dataset (Critic, Community and Barstool Staff (Dave)) with the scores in the jared dataset.

Pizza_Averages <- c("Dave (Barstool)","Community","Critic","Jared")
y <- c(6.622,7.0846,7.256,6.802)
barplot(y,names.arg = Pizza_Averages,xlab = "Average Review",ylab = "Rating",col = "blue", main = "Pizza Averages")

Represents the values that are in both the barstool and jared datasets

##  [1] "Williamsburg Pizza"        "Little Italy Pizza"       
##  [3] "Steve's Pizza"             "Girello"                  
##  [5] "5 Boroughs Pizza"          "Saluggi's"                
##  [7] "Artichoke Basille's Pizza" "Bleecker Street Pizza"    
##  [9] "Joe's Pizza"               "Champion Pizza"           
## [11] "Prince Street Pizza"       "Kiss My Slice"            
## [13] "Arturo's"                  "Vinny Vincenz"            
## [15] "Stella's Pizza"            "Gotham Pizza"             
## [17] "Highline Pizza"            "Pizza Italia"             
## [19] "Rivoli Pizza"              "NY Pizza Suprema"         
## [21] "Previti Pizza"             "Rocco's Pizza Joint"

Datafiniti Cleaning

Since there weren’t any NA values in the datafiniti dataset, we simply removed duplicate values and changed the name of “province”" to “State Code” to align with the column name in the region dataset.

# Remove duplicates

datafiniti_unique <- unique(datafiniti)
datafiniti_unique # 2,285 unique observations
## # A tibble: 2,285 x 10
##    name  address city  country province latitude longitude categories
##    <chr> <chr>   <chr> <chr>   <chr>       <dbl>     <dbl> <chr>     
##  1 Shot~ 4203 E~ Sher~ US      AR           34.8     -92.2 Pizza,Res~
##  2 Sauc~ 25 E C~ Phoe~ US      AZ           33.5    -112.  Pizza,Piz~
##  3 Mios~ 3703 P~ Cinc~ US      OH           39.1     -84.4 Restauran~
##  4 Hung~ 30495 ~ Madi~ US      MI           42.5     -83.1 Pizza,Car~
##  5 Spar~ 3600 E~ Balt~ US      MD           39.3     -76.6 Pizza,Ame~
##  6 La V~ 1834 E~ Berk~ US      CA           37.9    -122.  Pizza Pla~
##  7 Bric~ 4819 K~ Tall~ US      FL           30.5     -84.2 Pizza Pla~
##  8 Carr~ 4061 2~ Gran~ US      MI           42.9     -85.6 Restauran~
##  9 Dome~ 146 N ~ Glen~ US      CA           34.1    -118.  Pizza,Res~
## 10 Litt~ 965 N ~ El P~ US      TX           31.9    -107.  Restauran~
## # ... with 2,275 more rows, and 2 more variables: price_range_min <dbl>,
## #   price_range_max <dbl>
# rename province

names(datafiniti_unique)[names(datafiniti_unique) == 'province'] <- 'State Code'

Our next task was to make sense of the categories column, which originally contained values in one column:

head(datafiniti_unique$categories)
## [1] "Pizza,Restaurant,American restaurants,Pizza Place,Restaurants"                  
## [2] "Pizza,Pizza Place,Restaurants"                                                  
## [3] "Restaurant,Pizza Place,Restaurants"                                             
## [4] "Pizza,Carry-out food,Pizza Place,Restaurants"                                   
## [5] "Pizza,American restaurants,Pizza Place,Pizza equipment and supplies,Restaurants"
## [6] "Pizza Place"

We accomplished this by:

  1. Creating a unique id for each pizza place
##   loc_id                name
## 1      1  Shotgun Dans Pizza
## 2      2    Sauce Pizza Wine
## 3      3       Mios Pizzeria
## 4      4 Hungry Howies Pizza
## 5      5    Spartan Pizzeria
## 6      6             La Vals
  1. Separating the different categories (separated by a comma) into individual columns
# Separate categories

datafiniti_cat <- datafiniti_unique %>% separate(categories,
                                                  c("Cat1", "Cat2", "Cat3", "Cat4", "Cat5","Cat6", 
                                                  "Cat7", "Cat8", "Cat9", "Cat10", "Cat11"), sep = ",")
  1. Creating a new dataframe with only the unique category id and separated categories
##   loc_id        Cat1                 Cat2                 Cat3
## 1      1       Pizza           Restaurant American restaurants
## 2      2       Pizza          Pizza Place          Restaurants
## 3      3  Restaurant          Pizza Place          Restaurants
## 4      4       Pizza       Carry-out food          Pizza Place
## 5      5       Pizza American restaurants          Pizza Place
## 6      6 Pizza Place                 <NA>                 <NA>
##                           Cat4        Cat5 Cat6 Cat7 Cat8 Cat9 Cat10
## 1                  Pizza Place Restaurants <NA> <NA> <NA> <NA>  <NA>
## 2                         <NA>        <NA> <NA> <NA> <NA> <NA>  <NA>
## 3                         <NA>        <NA> <NA> <NA> <NA> <NA>  <NA>
## 4                  Restaurants        <NA> <NA> <NA> <NA> <NA>  <NA>
## 5 Pizza equipment and supplies Restaurants <NA> <NA> <NA> <NA>  <NA>
## 6                         <NA>        <NA> <NA> <NA> <NA> <NA>  <NA>
  1. Gathering the categories into one column and removing NA values
# gather categories

datafiniti_cat <- 
  datafiniti_cat %>% 
  gather(Cat, Category, 2:11) %>% 
  arrange(loc_id)

# remove NA categories

datafiniti_cat <- datafiniti_cat[!is.na(datafiniti_cat$Category), ]

# remove Cat column

datafiniti_cat <- datafiniti_cat[ , c(1,3)]

head(datafiniti_cat)
##    loc_id             Category
## 1       1                Pizza
## 2       1           Restaurant
## 3       1 American restaurants
## 4       1          Pizza Place
## 5       1          Restaurants
## 11      2                Pizza
  1. Shrinking the list into unique values
# unique categories w\ count

datafiniti_cat %>% 
  group_by(Category) %>% 
  summarize(count = n()) %>% 
  arrange(desc(count))
## # A tibble: 308 x 2
##    Category                           count
##    <chr>                              <int>
##  1 Pizza Place                         2286
##  2 Restaurant                          1104
##  3 Pizza                                655
##  4 Restaurants                          650
##  5 Italian Restaurant                   225
##  6 Italian                               71
##  7 Caterers                              49
##  8 Italian Restaurants                   47
##  9 American restaurants                  45
## 10 Italian Restaurant and Pizza Place    43
## # ... with 298 more rows
  1. Cleaning and consolidating the categories (#### unique)
# Clean \ combine categories

datafiniti_cat$Category[datafiniti_cat$Category == "Restaurants"] <- "Restaurant"
datafiniti_cat$Category[datafiniti_cat$Category == "Italian Restaurants"] <- "Italian Restaurant"
datafiniti_cat$Category[datafiniti_cat$Category == "Italian"] <- "Italian Restaurant"
datafiniti_cat$Category[datafiniti_cat$Category == "Italian restaurants"] <- "Italian Restaurant"
datafiniti_cat$Category[datafiniti_cat$Category == " and Italian restaurant"] <- "Italian Restaurant"
datafiniti_cat$Category[datafiniti_cat$Category == "Pizza"] <- "Pizza Place"
datafiniti_cat$Category[datafiniti_cat$Category == "Pizza Restaurants"] <- "Pizza Place"
datafiniti_cat$Category[datafiniti_cat$Category == "Catering"] <- "Caterers"
datafiniti_cat$Category[datafiniti_cat$Category == "American restaurants"] <- "American Restaurant"
datafiniti_cat$Category[datafiniti_cat$Category == "American Restaurants"] <- "American Restaurant"
datafiniti_cat$Category[datafiniti_cat$Category == "wich Place"] <- "Sandwich Place"
datafiniti_cat$Category[datafiniti_cat$Category == " Sandwich Place"] <- "Sandwich Place"
datafiniti_cat$Category[datafiniti_cat$Category == "American restaurants"] <- "American Restaurant"
datafiniti_cat$Category[!str_detect(datafiniti_cat$Category, "^Pizza Place and") &
                          str_detect(datafiniti_cat$Category, "^Pizza Place ")] <- "Pizza Place"
datafiniti_cat$Category[datafiniti_cat$Category == "Pizza Place and Rock Club Charles North"] <- "Pizza Place and Rock Club"
datafiniti_cat$Category[datafiniti_cat$Category == "Bars"] <- "Bar"
datafiniti_cat$Category[datafiniti_cat$Category == " Bar"] <- "Bar"
datafiniti_cat$Category[datafiniti_cat$Category == " and Bar"] <- "Bar"
datafiniti_cat$Category[datafiniti_cat$Category == " and Pizza Place"] <- "Pizza Place"
datafiniti_cat$Category[datafiniti_cat$Category == " and Diner"] <- "Diner"
datafiniti_cat$Category[datafiniti_cat$Category == "Ne\\\\pizza Place"] <- "Pizza Place"
datafiniti_cat$Category[datafiniti_cat$Category == "Fl\\\\pizza Place"] <- "Pizza Place"
datafiniti_cat$Category[datafiniti_cat$Category == "Fl\\\\pizza Place"] <- "Pizza Place"

#rename category

names(datafiniti_cat)[names(datafiniti_cat) == 'Category'] <- 'new_cat'

datafiniti_cat %>% 
  group_by(new_cat) %>% 
  summarize(count = n()) %>% 
  arrange(desc(count))
## # A tibble: 158 x 2
##    new_cat                            count
##    <chr>                              <int>
##  1 Pizza Place                         3121
##  2 Restaurant                          1754
##  3 Italian Restaurant                   376
##  4 Caterers                              75
##  5 American Restaurant                   73
##  6 Italian Restaurant and Pizza Place    43
##  7 Bar                                   41
##  8 Pizza Place and Italian Restaurant    31
##  9 Take Out Restaurants                  29
## 10 Sandwich Place                        25
## # ... with 148 more rows
  1. Merging with the original dataset to end up with a row for each pizza place & category for analysis by category
# merge w\ new categories & delete original categories

datafiniti_new <- merge(datafiniti_unique, datafiniti_cat, by = 'loc_id') 
datafiniti_new <- datafiniti_new %>% 
  select(-categories)

head(datafiniti_new)
##   loc_id               name           address     city country State Code
## 1      1 Shotgun Dans Pizza  4203 E Kiehl Ave Sherwood      US         AR
## 2      1 Shotgun Dans Pizza  4203 E Kiehl Ave Sherwood      US         AR
## 3      1 Shotgun Dans Pizza  4203 E Kiehl Ave Sherwood      US         AR
## 4      1 Shotgun Dans Pizza  4203 E Kiehl Ave Sherwood      US         AR
## 5      1 Shotgun Dans Pizza  4203 E Kiehl Ave Sherwood      US         AR
## 6      2   Sauce Pizza Wine 25 E Camelback Rd  Phoenix      US         AZ
##   latitude longitude price_range_min price_range_max             new_cat
## 1 34.83230  -92.1838               0              25         Pizza Place
## 2 34.83230  -92.1838               0              25          Restaurant
## 3 34.83230  -92.1838               0              25 American Restaurant
## 4 34.83230  -92.1838               0              25         Pizza Place
## 5 34.83230  -92.1838               0              25          Restaurant
## 6 33.50927 -112.0730               0              25         Pizza Place

Proposed EDA

  • For the datafiniti data set, we’d like to look at

  • Histograms / Boxplots of the prices

  • Look at prices by region and state

  • Display results on a map (potentially including barstool)

  • Average prices by category (potentially including barstool)

  • For the barstool and jared data sets we’d like to look at

  • Descriptive statistics for ratings

  • Rating by location / state

  • Rating by category

  • Compare ratings by reviewer