Introduction

In the analysis of these 3 datasets, we are addressing average rating, location, and pricing information of various pizza establishments around the US. The goal is to assist the average American consumer in making informed decisions about which pizza restaurants they would like to patronize.

Data Introduction

This analysis is based on 3 datasets with different information about pizza restaurants in the US:

  • Barstool - Contains critic, public and the barstool staff’s rating as well as pricing, location and geolocation of the different pizza places.

  • Jared - Contains ratings for New York pizza restaurants on a 6-point likert scale.

  • Datafiniti - Contains price range, location and keywords for pizza restaurants around the US.

  • Region - Contains state by region from the US Census Bureau

To learn more about the datasets used, click here or view a more detailed explanation in the Data Preparation section below.

Approach

Our approach was to divide the analysis to focus on two groups of consumers: 1) those who are price-conscious and 2) those who are interested in finding the pizza place near them with the highest rating.

Packages

library(tidyverse) # data manipulation and cleaning
library(knitr) # data dictionary tables
library(kableExtra) # table formatting
library(readxl) # importing dataset descriptions
library(leaflet) # interactive map & geolocation

Data Dictionary

Background

The pizza data was originally compiled on September 30th, 2019 as a part of a social data project called TidyTuesday. The original intent is for R users to practice their creativity and analytical techniques, specifically with the tidyverse package.

For those who are interested, there’s a 5-minute podcast reviewing the top entries and different visualization techniques for Week 7’s project called “Pizza Party”.

We created the regions dataset based on the definitions from the US Census Bureau.

Data Dictionary

Jared Dataset

variable class description
polla_qid integer Quiz ID
answer character Answer (likert scale)
votes integer Number of votes for that question/answer combo
pollq_id integer Poll Question ID
question character Question
place character Pizza Place
time integer Time of quiz
total_votes integer Total number of votes for that pizza place
percent double Vote percent of total for that pizza place

Barstool Dataset

variable class description
name character Pizza place name
address1 character Pizza place address
city character City
zip double Zip
country character Country
latitude double Latitude
longitude double Longitude
price_level double Price rating (smaller = cheaper)
provider_rating double Provider review score
provider_review_count double Provider review count
review_stats_all_average_score double Average Score
review_stats_all_count double Count of all reviews
review_stats_all_total_score double Review total score
review_stats_community_average_score double Community average score
review_stats_community_count double community review count
review_stats_community_total_score double community review total score
review_stats_critic_average_score double Critic average score
review_stats_critic_count double Critic review count
review_stats_critic_total_score double Critic total score
review_stats_dave_average_score double Dave (Barstool) average score
review_stats_dave_count double Dave review count
review_stats_dave_total_score double Dave total score

Datafiniti Dataset

variable class description
polla_qid integer Quiz ID
answer character Answer (likert scale)
votes integer Number of votes for that question/answer combo
pollq_id integer Poll Question ID
question character Question
place character Pizza Place
time integer Time of quiz
total_votes integer Total number of votes for that pizza place
percent double Vote percent of total for that pizza place

Data Preparation

NA Values

colSums(is.na(barstool))
##                                 name                             address1 
##                                    0                                    0 
##                                 city                                  zip 
##                                    0                                    0 
##                              country                             latitude 
##                                    0                                    2 
##                            longitude                          price_level 
##                                    2                                    0 
##                      provider_rating                provider_review_count 
##                                    0                                    0 
##       review_stats_all_average_score               review_stats_all_count 
##                                    0                                    0 
##         review_stats_all_total_score review_stats_community_average_score 
##                                    0                                    0 
##         review_stats_community_count   review_stats_community_total_score 
##                                    0                                    0 
##    review_stats_critic_average_score            review_stats_critic_count 
##                                    0                                    0 
##      review_stats_critic_total_score      review_stats_dave_average_score 
##                                    0                                    0 
##              review_stats_dave_count        review_stats_dave_total_score 
##                                    0                                    0
colSums(is.na(jared))
##   polla_qid      answer       votes    pollq_id    question       place 
##           0           0           0           0           0           0 
##        time total_votes     percent 
##           0           0           5
colSums(is.na(datafiniti))
##            name         address            city         country 
##               0               0               0               0 
##        province        latitude       longitude      categories 
##               0               0               0               0 
## price_range_min price_range_max 
##               0               0
colSums(is.na(region))
##      State State Code     Region   Division 
##          0          0          0          0

Jared & Barstool Cleaning

Likert Scale Conversion

The jared dataset pizza ratings are on a 1 to 6 likert scale. However, barstool is on a 1-10 scale. We used a mutate function to convert the 1-6 scale to a 1-10 scale:

jared <- jared %>%
  mutate(answer = case_when(
    .$answer=="Never Again" ~ 0, 
    .$answer=="Poor" ~ 2,
    .$answer=="Fair" ~ 4,
    .$answer=="Average" ~ 6,
    .$answer=="Good"~ 8,
    .$answer=="Excellent" ~ 10))

jared$answer <- as.integer(jared$answer)
str(jared$answer)
##  int [1:375] 10 8 6 2 0 10 8 6 2 0 ...

Once the answers are converted to integers, a calculation is done to find the average rating for each pizza place:

jared <- mutate(jared,Weighted_Rating = answer*votes)


(Jared_Average <- jared %>%
    group_by(place) %>%
    summarise(avg_score = sum(Weighted_Rating)/sum(votes)))
## # A tibble: 56 x 2
##    place                       avg_score
##    <chr>                           <dbl>
##  1 5 Boroughs Pizza                 7.33
##  2 Artichoke Basille's Pizza        8   
##  3 Arturo's                         7.43
##  4 Bella Napoli                     7.07
##  5 Ben's of SoHo 14th Street        4.8 
##  6 Ben's of SoHo Spring Street      6.44
##  7 Big Slice Pizza                  6.27
##  8 Bleecker Street Pizza            8.29
##  9 Bravo Pizza                    NaN   
## 10 Cavallo's Pizza                  7.27
## # ... with 46 more rows
(Jared_Mean <- mean(Jared_Average$avg_score, na.rm = T))
## [1] 6.802782

Two columns are dropped for being insignificant:

jared$polla_qid = NULL
jared$time = NULL
head(jared)

This calculation takes the mean of average scores omiting the 0 values

Critic_average <- mean(NA^(barstool$review_stats_critic_average_score == 0)*barstool$review_stats_critic_average_score, na.rm = TRUE)
Daves_average <- mean(NA^(barstool$review_stats_dave_average_score == 0)*barstool$review_stats_dave_average_score, na.rm = TRUE)
Community_average <- mean(NA^(barstool$review_stats_community_average_score == 0)*barstool$review_stats_community_average_score, na.rm = TRUE)

The three calculations below allow to compare to average scores found in the barstool dataset (Critic, Community and Barstool Staff (Dave)) with the scores in the jared dataset.

Represents the values that are in both the barstool and jared datasets

##  [1] "Williamsburg Pizza"        "Little Italy Pizza"       
##  [3] "Steve's Pizza"             "Girello"                  
##  [5] "5 Boroughs Pizza"          "Saluggi's"                
##  [7] "Artichoke Basille's Pizza" "Bleecker Street Pizza"    
##  [9] "Joe's Pizza"               "Champion Pizza"           
## [11] "Prince Street Pizza"       "Kiss My Slice"            
## [13] "Arturo's"                  "Vinny Vincenz"            
## [15] "Stella's Pizza"            "Gotham Pizza"             
## [17] "Highline Pizza"            "Pizza Italia"             
## [19] "Rivoli Pizza"              "NY Pizza Suprema"         
## [21] "Previti Pizza"             "Rocco's Pizza Joint"

Datafiniti Cleaning

Since there weren’t any NA values in the datafiniti dataset, we simply removed duplicate values and changed the name of “province”" to “State Code” to align with the column name in the region dataset.

# Remove duplicates

datafiniti_unique <- unique(datafiniti)
datafiniti_unique # 2,285 unique observations
## # A tibble: 2,285 x 10
##    name  address city  country province latitude longitude categories
##    <chr> <chr>   <chr> <chr>   <chr>       <dbl>     <dbl> <chr>     
##  1 Shot~ 4203 E~ Sher~ US      AR           34.8     -92.2 Pizza,Res~
##  2 Sauc~ 25 E C~ Phoe~ US      AZ           33.5    -112.  Pizza,Piz~
##  3 Mios~ 3703 P~ Cinc~ US      OH           39.1     -84.4 Restauran~
##  4 Hung~ 30495 ~ Madi~ US      MI           42.5     -83.1 Pizza,Car~
##  5 Spar~ 3600 E~ Balt~ US      MD           39.3     -76.6 Pizza,Ame~
##  6 La V~ 1834 E~ Berk~ US      CA           37.9    -122.  Pizza Pla~
##  7 Bric~ 4819 K~ Tall~ US      FL           30.5     -84.2 Pizza Pla~
##  8 Carr~ 4061 2~ Gran~ US      MI           42.9     -85.6 Restauran~
##  9 Dome~ 146 N ~ Glen~ US      CA           34.1    -118.  Pizza,Res~
## 10 Litt~ 965 N ~ El P~ US      TX           31.9    -107.  Restauran~
## # ... with 2,275 more rows, and 2 more variables: price_range_min <dbl>,
## #   price_range_max <dbl>
# rename province

names(datafiniti_unique)[names(datafiniti_unique) == 'province'] <- 'State Code'

Our next task was to make sense of the categories column, which originally contained values in one column:

head(datafiniti_unique$categories)
## [1] "Pizza,Restaurant,American restaurants,Pizza Place,Restaurants"                  
## [2] "Pizza,Pizza Place,Restaurants"                                                  
## [3] "Restaurant,Pizza Place,Restaurants"                                             
## [4] "Pizza,Carry-out food,Pizza Place,Restaurants"                                   
## [5] "Pizza,American restaurants,Pizza Place,Pizza equipment and supplies,Restaurants"
## [6] "Pizza Place"

We accomplished this by:

  1. Creating a unique id for each pizza place
##   loc_id                name
## 1      1  Shotgun Dans Pizza
## 2      2    Sauce Pizza Wine
## 3      3       Mios Pizzeria
## 4      4 Hungry Howies Pizza
## 5      5    Spartan Pizzeria
## 6      6             La Vals
  1. Separating the different categories (separated by a comma) into individual columns
# Separate categories

datafiniti_cat <- datafiniti_unique %>% separate(categories, c("Cat1", "Cat2", "Cat3", "Cat4","Cat5",
                                                               "Cat6", "Cat7", "Cat8", "Cat9", "Cat10", 
                                                               "Cat11"), sep = ",")
  1. Creating a new dataframe with only the unique category id and separated categories
##   loc_id        Cat1                 Cat2                 Cat3
## 1      1       Pizza           Restaurant American restaurants
## 2      2       Pizza          Pizza Place          Restaurants
## 3      3  Restaurant          Pizza Place          Restaurants
## 4      4       Pizza       Carry-out food          Pizza Place
## 5      5       Pizza American restaurants          Pizza Place
## 6      6 Pizza Place                 <NA>                 <NA>
##                           Cat4        Cat5 Cat6 Cat7 Cat8 Cat9 Cat10
## 1                  Pizza Place Restaurants <NA> <NA> <NA> <NA>  <NA>
## 2                         <NA>        <NA> <NA> <NA> <NA> <NA>  <NA>
## 3                         <NA>        <NA> <NA> <NA> <NA> <NA>  <NA>
## 4                  Restaurants        <NA> <NA> <NA> <NA> <NA>  <NA>
## 5 Pizza equipment and supplies Restaurants <NA> <NA> <NA> <NA>  <NA>
## 6                         <NA>        <NA> <NA> <NA> <NA> <NA>  <NA>
  1. Gathering the categories into one column and removing NA values
##    loc_id             Category
## 1       1                Pizza
## 2       1           Restaurant
## 3       1 American restaurants
## 4       1          Pizza Place
## 5       1          Restaurants
## 11      2                Pizza
  1. Shrinking the list into unique values
# unique categories w/ count

datafiniti_cat %>% 
  group_by(Category) %>% 
  summarize(count = n()) %>% 
  arrange(desc(count))
## # A tibble: 308 x 2
##    Category                           count
##    <chr>                              <int>
##  1 Pizza Place                         2286
##  2 Restaurant                          1104
##  3 Pizza                                655
##  4 Restaurants                          650
##  5 Italian Restaurant                   225
##  6 Italian                               71
##  7 Caterers                              49
##  8 Italian Restaurants                   47
##  9 American restaurants                  45
## 10 Italian Restaurant and Pizza Place    43
## # ... with 298 more rows
  1. Repeating steps 2-5, separating this time on the word “and”. This allowed us to further separate categories such as “Italian Restaurant and Pizza Place” into their own values.
## # A tibble: 312 x 2
##    Category            count
##    <chr>               <int>
##  1 Pizza Place          2286
##  2 Restaurant           1104
##  3 Pizza                 655
##  4 Restaurants           650
##  5 Italian Restaurant    225
##  6 "Pizza Place "         81
##  7 Italian                71
##  8 " Pizza Place"         53
##  9 Caterers               49
## 10 Italian Restaurants    47
## # ... with 302 more rows
  1. Cleaning and consolidating into 10 main categories and an “Other” category containing descriptions with < 10 records each.
Clean_cat count
Pizza Place 3228
Restaurant 1821
Italian 462
Other 153
American 93
Bar / Pub 83
Caterer 75
Pizza Restaurant 46
Sandwich Place 37
Delivery / Carry-out 23
Karaoke 10
loc_id Category Clean_cat
1 Pizza Pizza Place
1 Restaurant Restaurant
1 American restaurants American
1 Pizza Place Pizza Place
1 Restaurants Restaurant
2 Pizza Pizza Place
2 Pizza Place Pizza Place
2 Restaurants Restaurant
3 Restaurant Restaurant
3 Pizza Place Pizza Place
  1. Merging with the original dataset to end up with a row for each pizza place & category for analysis by category
# merge w\ new categories & delete original categories

datafiniti_new <- merge(datafiniti_unique, datafiniti_cat, by = 'loc_id') 
datafiniti_new <- datafiniti_new %>% 
  select(-c(categories, Category))

datafiniti_new <- unique(datafiniti_new)

kable(head(select(datafiniti_new, c(loc_id, name, new_cat)), 10)) %>% 
   kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
loc_id name new_cat
1 1 Shotgun Dans Pizza Pizza Place
2 1 Shotgun Dans Pizza Restaurant
3 1 Shotgun Dans Pizza American
6 2 Sauce Pizza Wine Pizza Place
8 2 Sauce Pizza Wine Restaurant
9 3 Mios Pizzeria Restaurant
10 3 Mios Pizzeria Pizza Place
12 4 Hungry Howies Pizza Pizza Place
13 4 Hungry Howies Pizza Delivery / Carry-out
15 4 Hungry Howies Pizza Restaurant

Exploratory Analysis

First, we looked at a brief overview of prices by category:

datafiniti_new %>% 
  group_by(new_cat) %>% 
  summarize(Avg_Min_Price = mean(price_range_min), Avg_Max_Price = mean(price_range_max)) %>% 
  mutate(Range = Avg_Max_Price - Avg_Min_Price) %>% 
  arrange(Avg_Min_Price)
## # A tibble: 11 x 4
##    new_cat              Avg_Min_Price Avg_Max_Price Range
##    <chr>                        <dbl>         <dbl> <dbl>
##  1 Sandwich Place                2             26.2  24.2
##  2 Pizza Restaurant              2.52          26.6  24.1
##  3 Caterer                       4.3           27.6  23.3
##  4 Restaurant                    4.41          27.6  23.2
##  5 Delivery / Carry-out          4.55          27.7  23.2
##  6 Pizza Place                   4.67          27.8  23.1
##  7 Karaoke                       5             28    23  
##  8 Other                         6.16          28.6  22.4
##  9 American                      7.57          29.4  21.8
## 10 Italian                       9.74          30.9  21.2
## 11 Bar / Pub                    10.8           31.5  20.7

Next, we delved a little deeper into the pricing structure:

# Price Exploration -------------------------------------------------------

# min \ max summary

summary(datafiniti_unique$price_range_min) # 3rd quartile is still 0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00    4.67    0.00   50.00
summary(datafiniti_unique$price_range_max) # ranges from 7 to 55
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     7.0    25.0    25.0    27.8    25.0    55.0
# create spread variable

datafiniti_unique <- datafiniti_unique %>% 
  mutate(Range = price_range_max - price_range_min)

datafiniti_new <- datafiniti_new %>% 
  mutate(Range = price_range_max - price_range_min)

summary(datafiniti_unique$Range)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   25.00   25.00   23.13   25.00   34.00
summary(datafiniti_new$Range)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   25.00   25.00   22.95   25.00   34.00
# histogram for each

datafiniti_unique %>% 
  ggplot(aes(x = price_range_min)) +
  geom_histogram(binwidth = 2) +
  coord_cartesian(xlim = c(0,40))  

datafiniti_unique %>% 
  ggplot(aes(x = price_range_max)) +
  geom_histogram(binwidth = 2) +
  coord_cartesian(xlim = c(20,55))

datafiniti_unique %>% 
  ggplot(aes(x = Range)) +
  geom_histogram(binwidth = 2) +
  coord_cartesian(xlim = c(10,30))

# boxplot for each

datafiniti_new %>% 
  ggplot(aes(x = new_cat, y = price_range_min)) +
  geom_boxplot() +
  coord_flip()

datafiniti_new %>% 
  ggplot(aes(x = new_cat, y = price_range_max)) +
  geom_boxplot() +
  coord_flip()

datafiniti_new %>% 
  ggplot(aes(x = new_cat, y = Range)) +
  geom_boxplot() +
  coord_flip()

# Price by Region / State ------------------------------------------------------------------

# add Region to datafiniti_unique

datafiniti_unique <- 
  datafiniti_unique %>% 
  merge(region, by = "State Code") %>% 
  select(-(Division))

# Min / max by region

datafiniti_unique %>% 
  group_by(Region) %>% 
  summarize(Avg_min_price = mean(price_range_min),
            Avg_max_price = mean(price_range_max))
## # A tibble: 4 x 3
##   Region    Avg_min_price Avg_max_price
##   <chr>             <dbl>         <dbl>
## 1 Midwest            4.35          27.6
## 2 Northeast          3.53          27.0
## 3 South              5.14          28.1
## 4 West               5.75          28.5
# Min / max by state

datafiniti_unique %>% 
  group_by(State) %>% 
  summarize(Avg_max_price = mean(price_range_max)) %>% 
  arrange(desc(Avg_max_price))
## # A tibble: 44 x 2
##    State         Avg_max_price
##    <chr>                 <dbl>
##  1 Connecticut            40  
##  2 South Dakota           32.5
##  3 Iowa                   31  
##  4 Alabama                30  
##  5 Nebraska               30  
##  6 Arizona                29.9
##  7 West Virginia          29.3
##  8 Maryland               29.2
##  9 Louisiana              29.1
## 10 Oregon                 29.1
## # ... with 34 more rows
datafiniti_unique %>% 
  group_by(State) %>% 
  summarize(Avg_min_price = mean(price_range_min)) %>% 
  arrange(Avg_min_price)
## # A tibble: 44 x 2
##    State          Avg_min_price
##    <chr>                  <dbl>
##  1 Alaska                  0   
##  2 Delaware                0   
##  3 Hawaii                  0   
##  4 Montana                 0   
##  5 New Jersey              0   
##  6 North Dakota            0   
##  7 Nevada                  1.25
##  8 South Carolina          1.85
##  9 Arkansas                1.92
## 10 Kansas                  2.08
## # ... with 34 more rows

Barstool Restaurant Locations

Locations of the restaurants in the barstool dataset

m <- leaflet(barstool) %>% addTiles()
m %>% addCircles(lng = ~ barstool$longitude, lat = ~ barstool$latitude, popup = barstool$nameL, weight = 8, color = "#fb3004", stroke = TRUE)

This map shows the barstool restaurants which have been reviewed. The circles make it easy to portray a visual of where most reviews are from. For example, zooming in on New York will show hundreds of restaurants.

Datafiniti Point Plot

ggplot(data = datafiniti, aes(x=price_range_max,y=price_range_min)) + 
  geom_point(color = "blue", size=2,shape=17) 

The three calculations below allow to compare to average scores found in the barstool dataset (Critic, Community and Barstool Staff (Dave)) with the scores in the jared dataset.

Pizza_Averages <- c("Dave (Barstool)","Community","Critic","Jared")
y <- c(6.622,7.0846,7.256,6.802)
barplot(y,names.arg = Pizza_Averages,xlab = "Average Review",ylab = "Rating",col = "blue", main = "Pizza Averages")

Summary

Our goal was to educate consumers on how they want to spend their money in the pizza industry. We looked at ratings between the barstool and jared datasets, which concluded that the jared pizza average was higher than barstool.The barstool map was shown to see where most restaraunts are. In the datafiniti set we looked at price range and location, which concludes the West has the highest price value. In terms of states, Conneticut has the highest average max price. The median price range is $25.