Deep (dish) Data Dive

Introduction

In the analysis of these 3 datasets, we are addressing average rating, location, and pricing information of various pizza establishments around the US. The goal is to assist the average American consumer in making informed decisions about which pizza restaurants they would like to patronize.

Data Introduction

This analysis is based on 3 datasets with different information about pizza restaurants in the US:

Barstool - Contains critic, public and the barstool staff’s rating as well as pricing, location and geolocation of the different pizza places.
Jared - Contains ratings for New York pizza restaurants on a 6-point likert scale.
Datafiniti - Contains price range, location and keywords for pizza restaurants around the US.
Region - Contains state by region from the US Census Bureau

To learn more about the datasets used, click here or view a more detailed explanation in the Data Preparation section below.

Approach

Our approach was to divide the analysis to focus on two groups of consumers: 1) those who are price-conscious and 2) those who are interested in finding the pizza place near them with the highest rating.

Packages

library(tidyverse) # data manipulation and cleaning
library(knitr) # data dictionary tables
library(kableExtra) # table formatting
library(readxl) # importing dataset descriptions
library(leaflet) # interactive map & geolocation

Data Dictionary

Background

The pizza data was originally compiled on September 30th, 2019 as a part of a social data project called TidyTuesday. The original intent is for R users to practice their creativity and analytical techniques, specifically with the tidyverse package.

For those who are interested, there’s a 5-minute podcast reviewing the top entries and different visualization techniques for Week 7’s project called “Pizza Party”.

We created the regions dataset based on the definitions from the US Census Bureau.

Data Dictionary

Jared Dataset

variable	class	description
polla_qid	integer	Quiz ID
answer	character	Answer (likert scale)
votes	integer	Number of votes for that question/answer combo
pollq_id	integer	Poll Question ID
question	character	Question
place	character	Pizza Place
time	integer	Time of quiz
total_votes	integer	Total number of votes for that pizza place
percent	double	Vote percent of total for that pizza place

Barstool Dataset

variable	class	description
name	character	Pizza place name
address1	character	Pizza place address
city	character	City
zip	double	Zip
country	character	Country
latitude	double	Latitude
longitude	double	Longitude
price_level	double	Price rating (smaller = cheaper)
provider_rating	double	Provider review score
provider_review_count	double	Provider review count
review_stats_all_average_score	double	Average Score
review_stats_all_count	double	Count of all reviews
review_stats_all_total_score	double	Review total score
review_stats_community_average_score	double	Community average score
review_stats_community_count	double	community review count
review_stats_community_total_score	double	community review total score
review_stats_critic_average_score	double	Critic average score
review_stats_critic_count	double	Critic review count
review_stats_critic_total_score	double	Critic total score
review_stats_dave_average_score	double	Dave (Barstool) average score
review_stats_dave_count	double	Dave review count
review_stats_dave_total_score	double	Dave total score

Datafiniti Dataset

variable	class	description
polla_qid	integer	Quiz ID
answer	character	Answer (likert scale)
votes	integer	Number of votes for that question/answer combo
pollq_id	integer	Poll Question ID
question	character	Question
place	character	Pizza Place
time	integer	Time of quiz
total_votes	integer	Total number of votes for that pizza place
percent	double	Vote percent of total for that pizza place

Data Preparation

NA Values

colSums(is.na(barstool))

##                                 name                             address1 
##                                    0                                    0 
##                                 city                                  zip 
##                                    0                                    0 
##                              country                             latitude 
##                                    0                                    2 
##                            longitude                          price_level 
##                                    2                                    0 
##                      provider_rating                provider_review_count 
##                                    0                                    0 
##       review_stats_all_average_score               review_stats_all_count 
##                                    0                                    0 
##         review_stats_all_total_score review_stats_community_average_score 
##                                    0                                    0 
##         review_stats_community_count   review_stats_community_total_score 
##                                    0                                    0 
##    review_stats_critic_average_score            review_stats_critic_count 
##                                    0                                    0 
##      review_stats_critic_total_score      review_stats_dave_average_score 
##                                    0                                    0 
##              review_stats_dave_count        review_stats_dave_total_score 
##                                    0                                    0

colSums(is.na(jared))

##   polla_qid      answer       votes    pollq_id    question       place 
##           0           0           0           0           0           0 
##        time total_votes     percent 
##           0           0           5

colSums(is.na(datafiniti))

##            name         address            city         country 
##               0               0               0               0 
##        province        latitude       longitude      categories 
##               0               0               0               0 
## price_range_min price_range_max 
##               0               0

colSums(is.na(region))

##      State State Code     Region   Division 
##          0          0          0          0

Jared & Barstool Cleaning

Likert Scale Conversion

The jared dataset pizza ratings are on a 1 to 6 likert scale. However, barstool is on a 1-10 scale. We used a mutate function to convert the 1-6 scale to a 1-10 scale:

jared <- jared %>%
  mutate(answer = case_when(
    .$answer=="Never Again" ~ 0, 
    .$answer=="Poor" ~ 2,
    .$answer=="Fair" ~ 4,
    .$answer=="Average" ~ 6,
    .$answer=="Good"~ 8,
    .$answer=="Excellent" ~ 10))

jared$answer <- as.integer(jared$answer)
str(jared$answer)

##  int [1:375] 10 8 6 2 0 10 8 6 2 0 ...

Once the answers are converted to integers, a calculation is done to find the average rating for each pizza place:

jared <- mutate(jared,Weighted_Rating = answer*votes)


(Jared_Average <- jared %>%
    group_by(place) %>%
    summarise(avg_score = sum(Weighted_Rating)/sum(votes)))

## # A tibble: 56 x 2
##    place                       avg_score
##    <chr>                           <dbl>
##  1 5 Boroughs Pizza                 7.33
##  2 Artichoke Basille's Pizza        8   
##  3 Arturo's                         7.43
##  4 Bella Napoli                     7.07
##  5 Ben's of SoHo 14th Street        4.8 
##  6 Ben's of SoHo Spring Street      6.44
##  7 Big Slice Pizza                  6.27
##  8 Bleecker Street Pizza            8.29
##  9 Bravo Pizza                    NaN   
## 10 Cavallo's Pizza                  7.27
## # ... with 46 more rows

(Jared_Mean <- mean(Jared_Average$avg_score, na.rm = T))

## [1] 6.802782

Two columns are dropped for being insignificant:

jared$polla_qid = NULL
jared$time = NULL
head(jared)

This calculation takes the mean of average scores omiting the 0 values

Critic_average <- mean(NA^(barstool$review_stats_critic_average_score == 0)*barstool$review_stats_critic_average_score, na.rm = TRUE)
Daves_average <- mean(NA^(barstool$review_stats_dave_average_score == 0)*barstool$review_stats_dave_average_score, na.rm = TRUE)
Community_average <- mean(NA^(barstool$review_stats_community_average_score == 0)*barstool$review_stats_community_average_score, na.rm = TRUE)

The three calculations below allow to compare to average scores found in the barstool dataset (Critic, Community and Barstool Staff (Dave)) with the scores in the jared dataset.

Represents the values that are in both the barstool and jared datasets

##  [1] "Williamsburg Pizza"        "Little Italy Pizza"       
##  [3] "Steve's Pizza"             "Girello"                  
##  [5] "5 Boroughs Pizza"          "Saluggi's"                
##  [7] "Artichoke Basille's Pizza" "Bleecker Street Pizza"    
##  [9] "Joe's Pizza"               "Champion Pizza"           
## [11] "Prince Street Pizza"       "Kiss My Slice"            
## [13] "Arturo's"                  "Vinny Vincenz"            
## [15] "Stella's Pizza"            "Gotham Pizza"             
## [17] "Highline Pizza"            "Pizza Italia"             
## [19] "Rivoli Pizza"              "NY Pizza Suprema"         
## [21] "Previti Pizza"             "Rocco's Pizza Joint"

Datafiniti Cleaning

Since there weren’t any NA values in the datafiniti dataset, we simply removed duplicate values and changed the name of “province”" to “State Code” to align with the column name in the region dataset.

# Remove duplicates

datafiniti_unique <- unique(datafiniti)
datafiniti_unique # 2,285 unique observations

## # A tibble: 2,285 x 10
##    name  address city  country province latitude longitude categories
##    <chr> <chr>   <chr> <chr>   <chr>       <dbl>     <dbl> <chr>     
##  1 Shot~ 4203 E~ Sher~ US      AR           34.8     -92.2 Pizza,Res~
##  2 Sauc~ 25 E C~ Phoe~ US      AZ           33.5    -112.  Pizza,Piz~
##  3 Mios~ 3703 P~ Cinc~ US      OH           39.1     -84.4 Restauran~
##  4 Hung~ 30495 ~ Madi~ US      MI           42.5     -83.1 Pizza,Car~
##  5 Spar~ 3600 E~ Balt~ US      MD           39.3     -76.6 Pizza,Ame~
##  6 La V~ 1834 E~ Berk~ US      CA           37.9    -122.  Pizza Pla~
##  7 Bric~ 4819 K~ Tall~ US      FL           30.5     -84.2 Pizza Pla~
##  8 Carr~ 4061 2~ Gran~ US      MI           42.9     -85.6 Restauran~
##  9 Dome~ 146 N ~ Glen~ US      CA           34.1    -118.  Pizza,Res~
## 10 Litt~ 965 N ~ El P~ US      TX           31.9    -107.  Restauran~
## # ... with 2,275 more rows, and 2 more variables: price_range_min <dbl>,
## #   price_range_max <dbl>

# rename province

names(datafiniti_unique)[names(datafiniti_unique) == 'province'] <- 'State Code'

Our next task was to make sense of the categories column, which originally contained values in one column:

head(datafiniti_unique$categories)

## [1] "Pizza,Restaurant,American restaurants,Pizza Place,Restaurants"                  
## [2] "Pizza,Pizza Place,Restaurants"                                                  
## [3] "Restaurant,Pizza Place,Restaurants"                                             
## [4] "Pizza,Carry-out food,Pizza Place,Restaurants"                                   
## [5] "Pizza,American restaurants,Pizza Place,Pizza equipment and supplies,Restaurants"
## [6] "Pizza Place"

We accomplished this by:

Creating a unique id for each pizza place

##   loc_id                name
## 1      1  Shotgun Dans Pizza
## 2      2    Sauce Pizza Wine
## 3      3       Mios Pizzeria
## 4      4 Hungry Howies Pizza
## 5      5    Spartan Pizzeria
## 6      6             La Vals

Separating the different categories (separated by a comma) into individual columns

# Separate categories

datafiniti_cat <- datafiniti_unique %>% separate(categories, c("Cat1", "Cat2", "Cat3", "Cat4","Cat5",
                                                               "Cat6", "Cat7", "Cat8", "Cat9", "Cat10", 
                                                               "Cat11"), sep = ",")

Creating a new dataframe with only the unique category id and separated categories

##   loc_id        Cat1                 Cat2                 Cat3
## 1      1       Pizza           Restaurant American restaurants
## 2      2       Pizza          Pizza Place          Restaurants
## 3      3  Restaurant          Pizza Place          Restaurants
## 4      4       Pizza       Carry-out food          Pizza Place
## 5      5       Pizza American restaurants          Pizza Place
## 6      6 Pizza Place                 <NA>                 <NA>
##                           Cat4        Cat5 Cat6 Cat7 Cat8 Cat9 Cat10
## 1                  Pizza Place Restaurants <NA> <NA> <NA> <NA>  <NA>
## 2                         <NA>        <NA> <NA> <NA> <NA> <NA>  <NA>
## 3                         <NA>        <NA> <NA> <NA> <NA> <NA>  <NA>
## 4                  Restaurants        <NA> <NA> <NA> <NA> <NA>  <NA>
## 5 Pizza equipment and supplies Restaurants <NA> <NA> <NA> <NA>  <NA>
## 6                         <NA>        <NA> <NA> <NA> <NA> <NA>  <NA>

Gathering the categories into one column and removing NA values

##    loc_id             Category
## 1       1                Pizza
## 2       1           Restaurant
## 3       1 American restaurants
## 4       1          Pizza Place
## 5       1          Restaurants
## 11      2                Pizza

Shrinking the list into unique values

# unique categories w/ count

datafiniti_cat %>% 
  group_by(Category) %>% 
  summarize(count = n()) %>% 
  arrange(desc(count))

## # A tibble: 308 x 2
##    Category                           count
##    <chr>                              <int>
##  1 Pizza Place                         2286
##  2 Restaurant                          1104
##  3 Pizza                                655
##  4 Restaurants                          650
##  5 Italian Restaurant                   225
##  6 Italian                               71
##  7 Caterers                              49
##  8 Italian Restaurants                   47
##  9 American restaurants                  45
## 10 Italian Restaurant and Pizza Place    43
## # ... with 298 more rows

Repeating steps 2-5, separating this time on the word “and”. This allowed us to further separate categories such as “Italian Restaurant and Pizza Place” into their own values.

## # A tibble: 312 x 2
##    Category            count
##    <chr>               <int>
##  1 Pizza Place          2286
##  2 Restaurant           1104
##  3 Pizza                 655
##  4 Restaurants           650
##  5 Italian Restaurant    225
##  6 "Pizza Place "         81
##  7 Italian                71
##  8 " Pizza Place"         53
##  9 Caterers               49
## 10 Italian Restaurants    47
## # ... with 302 more rows

Cleaning and consolidating into 10 main categories and an “Other” category containing descriptions with < 10 records each.

Clean_cat	count
Pizza Place	3228
Restaurant	1821
Italian	462
Other	153
American	93
Bar / Pub	83
Caterer	75
Pizza Restaurant	46
Sandwich Place	37
Delivery / Carry-out	23
Karaoke	10

loc_id	Category	Clean_cat
1	Pizza	Pizza Place
1	Restaurant	Restaurant
1	American restaurants	American
1	Pizza Place	Pizza Place
1	Restaurants	Restaurant
2	Pizza	Pizza Place
2	Pizza Place	Pizza Place
2	Restaurants	Restaurant
3	Restaurant	Restaurant
3	Pizza Place	Pizza Place

Merging with the original dataset to end up with a row for each pizza place & category for analysis by category

# merge w\ new categories & delete original categories

datafiniti_new <- merge(datafiniti_unique, datafiniti_cat, by = 'loc_id') 
datafiniti_new <- datafiniti_new %>% 
  select(-c(categories, Category))

datafiniti_new <- unique(datafiniti_new)

kable(head(select(datafiniti_new, c(loc_id, name, new_cat)), 10)) %>% 
   kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

	loc_id	name	new_cat
1	1	Shotgun Dans Pizza	Pizza Place
2	1	Shotgun Dans Pizza	Restaurant
3	1	Shotgun Dans Pizza	American
6	2	Sauce Pizza Wine	Pizza Place
8	2	Sauce Pizza Wine	Restaurant
9	3	Mios Pizzeria	Restaurant
10	3	Mios Pizzeria	Pizza Place
12	4	Hungry Howies Pizza	Pizza Place
13	4	Hungry Howies Pizza	Delivery / Carry-out
15	4	Hungry Howies Pizza	Restaurant

Exploratory Analysis

First, we looked at a brief overview of prices by category:

datafiniti_new %>% 
  group_by(new_cat) %>% 
  summarize(Avg_Min_Price = mean(price_range_min), Avg_Max_Price = mean(price_range_max)) %>% 
  mutate(Range = Avg_Max_Price - Avg_Min_Price) %>% 
  arrange(Avg_Min_Price)

## # A tibble: 11 x 4
##    new_cat              Avg_Min_Price Avg_Max_Price Range
##    <chr>                        <dbl>         <dbl> <dbl>
##  1 Sandwich Place                2             26.2  24.2
##  2 Pizza Restaurant              2.52          26.6  24.1
##  3 Caterer                       4.3           27.6  23.3
##  4 Restaurant                    4.41          27.6  23.2
##  5 Delivery / Carry-out          4.55          27.7  23.2
##  6 Pizza Place                   4.67          27.8  23.1
##  7 Karaoke                       5             28    23  
##  8 Other                         6.16          28.6  22.4
##  9 American                      7.57          29.4  21.8
## 10 Italian                       9.74          30.9  21.2
## 11 Bar / Pub                    10.8           31.5  20.7

Next, we delved a little deeper into the pricing structure:

# Price Exploration -------------------------------------------------------

# min \ max summary

summary(datafiniti_unique$price_range_min) # 3rd quartile is still 0

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00    4.67    0.00   50.00

summary(datafiniti_unique$price_range_max) # ranges from 7 to 55

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     7.0    25.0    25.0    27.8    25.0    55.0

# create spread variable

datafiniti_unique <- datafiniti_unique %>% 
  mutate(Range = price_range_max - price_range_min)

datafiniti_new <- datafiniti_new %>% 
  mutate(Range = price_range_max - price_range_min)

summary(datafiniti_unique$Range)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   25.00   25.00   23.13   25.00   34.00

summary(datafiniti_new$Range)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   25.00   25.00   22.95   25.00   34.00

# histogram for each

datafiniti_unique %>% 
  ggplot(aes(x = price_range_min)) +
  geom_histogram(binwidth = 2) +
  coord_cartesian(xlim = c(0,40))

datafiniti_unique %>% 
  ggplot(aes(x = price_range_max)) +
  geom_histogram(binwidth = 2) +
  coord_cartesian(xlim = c(20,55))

datafiniti_unique %>% 
  ggplot(aes(x = Range)) +
  geom_histogram(binwidth = 2) +
  coord_cartesian(xlim = c(10,30))

# boxplot for each

datafiniti_new %>% 
  ggplot(aes(x = new_cat, y = price_range_min)) +
  geom_boxplot() +
  coord_flip()

datafiniti_new %>% 
  ggplot(aes(x = new_cat, y = price_range_max)) +
  geom_boxplot() +
  coord_flip()

datafiniti_new %>% 
  ggplot(aes(x = new_cat, y = Range)) +
  geom_boxplot() +
  coord_flip()

# Price by Region / State ------------------------------------------------------------------

# add Region to datafiniti_unique

datafiniti_unique <- 
  datafiniti_unique %>% 
  merge(region, by = "State Code") %>% 
  select(-(Division))

# Min / max by region

datafiniti_unique %>% 
  group_by(Region) %>% 
  summarize(Avg_min_price = mean(price_range_min),
            Avg_max_price = mean(price_range_max))

## # A tibble: 4 x 3
##   Region    Avg_min_price Avg_max_price
##   <chr>             <dbl>         <dbl>
## 1 Midwest            4.35          27.6
## 2 Northeast          3.53          27.0
## 3 South              5.14          28.1
## 4 West               5.75          28.5

# Min / max by state

datafiniti_unique %>% 
  group_by(State) %>% 
  summarize(Avg_max_price = mean(price_range_max)) %>% 
  arrange(desc(Avg_max_price))

## # A tibble: 44 x 2
##    State         Avg_max_price
##    <chr>                 <dbl>
##  1 Connecticut            40  
##  2 South Dakota           32.5
##  3 Iowa                   31  
##  4 Alabama                30  
##  5 Nebraska               30  
##  6 Arizona                29.9
##  7 West Virginia          29.3
##  8 Maryland               29.2
##  9 Louisiana              29.1
## 10 Oregon                 29.1
## # ... with 34 more rows

datafiniti_unique %>% 
  group_by(State) %>% 
  summarize(Avg_min_price = mean(price_range_min)) %>% 
  arrange(Avg_min_price)

## # A tibble: 44 x 2
##    State          Avg_min_price
##    <chr>                  <dbl>
##  1 Alaska                  0   
##  2 Delaware                0   
##  3 Hawaii                  0   
##  4 Montana                 0   
##  5 New Jersey              0   
##  6 North Dakota            0   
##  7 Nevada                  1.25
##  8 South Carolina          1.85
##  9 Arkansas                1.92
## 10 Kansas                  2.08
## # ... with 34 more rows

Barstool Restaurant Locations

Locations of the restaurants in the barstool dataset

m <- leaflet(barstool) %>% addTiles()
m %>% addCircles(lng = ~ barstool$longitude, lat = ~ barstool$latitude, popup = barstool$nameL, weight = 8, color = "#fb3004", stroke = TRUE)

This map shows the barstool restaurants which have been reviewed. The circles make it easy to portray a visual of where most reviews are from. For example, zooming in on New York will show hundreds of restaurants.

Datafiniti Point Plot

ggplot(data = datafiniti, aes(x=price_range_max,y=price_range_min)) + 
  geom_point(color = "blue", size=2,shape=17)

The three calculations below allow to compare to average scores found in the barstool dataset (Critic, Community and Barstool Staff (Dave)) with the scores in the jared dataset.

Pizza_Averages <- c("Dave (Barstool)","Community","Critic","Jared")
y <- c(6.622,7.0846,7.256,6.802)
barplot(y,names.arg = Pizza_Averages,xlab = "Average Review",ylab = "Rating",col = "blue", main = "Pizza Averages")

Summary

Our goal was to educate consumers on how they want to spend their money in the pizza industry. We looked at ratings between the barstool and jared datasets, which concluded that the jared pizza average was higher than barstool.The barstool map was shown to see where most restaraunts are. In the datafiniti set we looked at price range and location, which concludes the West has the highest price value. In terms of states, Conneticut has the highest average max price. The median price range is $25.

Deep (dish) Data Dive

2019-12-03

Introduction

Data Introduction

Approach

Packages

Data Dictionary

Background

Data Dictionary

Jared Dataset

Barstool Dataset

Datafiniti Dataset

Data Preparation

NA Values

Jared & Barstool Cleaning

Likert Scale Conversion

Datafiniti Cleaning

Exploratory Analysis

Barstool Restaurant Locations

Datafiniti Point Plot

Summary