Deep (dish) Data Dive

Data Preparation

Background

The pizza data was originally compiled on September 30th, 2019 as a part of a social data project called TidyTuesday. The original intent is for R users to practice their creativity and analytical techniques, specifically with the tidyverse package.

For those who are interested, there’s a 5-minute podcast reviewing the top entries and different visualization techniques for Week 7’s project called “Pizza Party”.

We created the regions dataset based on the definitions from the US Census Bureau.

Data Dictionary

Jared Dataset

variable	class	description
polla_qid	integer	Quiz ID
answer	character	Answer (likert scale)
votes	integer	Number of votes for that question/answer combo
pollq_id	integer	Poll Question ID
question	character	Question
place	character	Pizza Place
time	integer	Time of quiz
total_votes	integer	Total number of votes for that pizza place
percent	double	Vote percent of total for that pizza place

Barstool Dataset

variable	class	description
name	character	Pizza place name
address1	character	Pizza place address
city	character	City
zip	double	Zip
country	character	Country
latitude	double	Latitude
longitude	double	Longitude
price_level	double	Price rating (smaller = cheaper)
provider_rating	double	Provider review score
provider_review_count	double	Provider review count
review_stats_all_average_score	double	Average Score
review_stats_all_count	double	Count of all reviews
review_stats_all_total_score	double	Review total score
review_stats_community_average_score	double	Community average score
review_stats_community_count	double	community review count
review_stats_community_total_score	double	community review total score
review_stats_critic_average_score	double	Critic average score
review_stats_critic_count	double	Critic review count
review_stats_critic_total_score	double	Critic total score
review_stats_dave_average_score	double	Dave (Barstool) average score
review_stats_dave_count	double	Dave review count
review_stats_dave_total_score	double	Dave total score

Datafiniti Dataset

variable	class	description
polla_qid	integer	Quiz ID
answer	character	Answer (likert scale)
votes	integer	Number of votes for that question/answer combo
pollq_id	integer	Poll Question ID
question	character	Question
place	character	Pizza Place
time	integer	Time of quiz
total_votes	integer	Total number of votes for that pizza place
percent	double	Vote percent of total for that pizza place

NA Values

colSums(is.na(barstool))

##                                 name                             address1 
##                                    0                                    0 
##                                 city                                  zip 
##                                    0                                    0 
##                              country                             latitude 
##                                    0                                    2 
##                            longitude                          price_level 
##                                    2                                    0 
##                      provider_rating                provider_review_count 
##                                    0                                    0 
##       review_stats_all_average_score               review_stats_all_count 
##                                    0                                    0 
##         review_stats_all_total_score review_stats_community_average_score 
##                                    0                                    0 
##         review_stats_community_count   review_stats_community_total_score 
##                                    0                                    0 
##    review_stats_critic_average_score            review_stats_critic_count 
##                                    0                                    0 
##      review_stats_critic_total_score      review_stats_dave_average_score 
##                                    0                                    0 
##              review_stats_dave_count        review_stats_dave_total_score 
##                                    0                                    0

colSums(is.na(jared))

##   polla_qid      answer       votes    pollq_id    question       place 
##           0           0           0           0           0           0 
##        time total_votes     percent 
##           0           0           5

colSums(is.na(datafiniti))

##            name         address            city         country 
##               0               0               0               0 
##        province        latitude       longitude      categories 
##               0               0               0               0 
## price_range_min price_range_max 
##               0               0

colSums(is.na(region))

##      State State Code     Region   Division 
##          0          0          0          0

Jared & Barstool Cleaning

Likert Scale Conversion

The jared dataset pizza ratings are on a 1 to 6 likert scale. However, barstool is on a 1-10 scale. We used a mutate function to convert the 1-6 scale to a 1-10 scale:

jared <- jared %>%
  mutate(answer = case_when(
    .$answer=="Never Again" ~ 0, 
    .$answer=="Poor" ~ 2,
    .$answer=="Fair" ~ 4,
    .$answer=="Average" ~ 6,
    .$answer=="Good"~ 8,
    .$answer=="Excellent" ~ 10))

jared$answer <- as.integer(jared$answer)
str(jared$answer)

##  int [1:375] 10 8 6 2 0 10 8 6 2 0 ...

Once the answers are converted to integers, a calculation is done to find the average rating for each pizza place:

jared <- mutate(jared,Weighted_Rating = answer*votes)


(Jared_Average <- jared %>%
    group_by(place) %>%
    summarise(avg_score = sum(Weighted_Rating)/sum(votes)))

## # A tibble: 56 x 2
##    place                       avg_score
##    <chr>                           <dbl>
##  1 5 Boroughs Pizza                 7.33
##  2 Artichoke Basille's Pizza        8   
##  3 Arturo's                         7.43
##  4 Bella Napoli                     7.07
##  5 Ben's of SoHo 14th Street        4.8 
##  6 Ben's of SoHo Spring Street      6.44
##  7 Big Slice Pizza                  6.27
##  8 Bleecker Street Pizza            8.29
##  9 Bravo Pizza                    NaN   
## 10 Cavallo's Pizza                  7.27
## # ... with 46 more rows

(Jared_Mean <- mean(Jared_Average$avg_score, na.rm = T))

## [1] 6.802782

Two columns are dropped for being insignificant:

jared$polla_qid = NULL
jared$time = NULL
head(jared)

This calculation takes the mean of average scores omiting the 0 values

Critic_average <- mean(NA^(barstool$review_stats_critic_average_score == 0)*barstool$review_stats_critic_average_score, na.rm = TRUE)
Daves_average <- mean(NA^(barstool$review_stats_dave_average_score == 0)*barstool$review_stats_dave_average_score, na.rm = TRUE)
Community_average <- mean(NA^(barstool$review_stats_community_average_score == 0)*barstool$review_stats_community_average_score, na.rm = TRUE)

The three calculations below allow to compare to average scores found in the barstool dataset (Critic, Community and Barstool Staff (Dave)) with the scores in the jared dataset.

Pizza_Averages <- c("Dave (Barstool)","Community","Critic","Jared")
y <- c(6.622,7.0846,7.256,6.802)
barplot(y,names.arg = Pizza_Averages,xlab = "Average Review",ylab = "Rating",col = "blue", main = "Pizza Averages")

Represents the values that are in both the barstool and jared datasets

##  [1] "Williamsburg Pizza"        "Little Italy Pizza"       
##  [3] "Steve's Pizza"             "Girello"                  
##  [5] "5 Boroughs Pizza"          "Saluggi's"                
##  [7] "Artichoke Basille's Pizza" "Bleecker Street Pizza"    
##  [9] "Joe's Pizza"               "Champion Pizza"           
## [11] "Prince Street Pizza"       "Kiss My Slice"            
## [13] "Arturo's"                  "Vinny Vincenz"            
## [15] "Stella's Pizza"            "Gotham Pizza"             
## [17] "Highline Pizza"            "Pizza Italia"             
## [19] "Rivoli Pizza"              "NY Pizza Suprema"         
## [21] "Previti Pizza"             "Rocco's Pizza Joint"

Datafiniti Cleaning

Since there weren’t any NA values in the datafiniti dataset, we simply removed duplicate values and changed the name of “province”" to “State Code” to align with the column name in the region dataset.

# Remove duplicates

datafiniti_unique <- unique(datafiniti)
datafiniti_unique # 2,285 unique observations

## # A tibble: 2,285 x 10
##    name  address city  country province latitude longitude categories
##    <chr> <chr>   <chr> <chr>   <chr>       <dbl>     <dbl> <chr>     
##  1 Shot~ 4203 E~ Sher~ US      AR           34.8     -92.2 Pizza,Res~
##  2 Sauc~ 25 E C~ Phoe~ US      AZ           33.5    -112.  Pizza,Piz~
##  3 Mios~ 3703 P~ Cinc~ US      OH           39.1     -84.4 Restauran~
##  4 Hung~ 30495 ~ Madi~ US      MI           42.5     -83.1 Pizza,Car~
##  5 Spar~ 3600 E~ Balt~ US      MD           39.3     -76.6 Pizza,Ame~
##  6 La V~ 1834 E~ Berk~ US      CA           37.9    -122.  Pizza Pla~
##  7 Bric~ 4819 K~ Tall~ US      FL           30.5     -84.2 Pizza Pla~
##  8 Carr~ 4061 2~ Gran~ US      MI           42.9     -85.6 Restauran~
##  9 Dome~ 146 N ~ Glen~ US      CA           34.1    -118.  Pizza,Res~
## 10 Litt~ 965 N ~ El P~ US      TX           31.9    -107.  Restauran~
## # ... with 2,275 more rows, and 2 more variables: price_range_min <dbl>,
## #   price_range_max <dbl>

# rename province

names(datafiniti_unique)[names(datafiniti_unique) == 'province'] <- 'State Code'

Our next task was to make sense of the categories column, which originally contained values in one column:

head(datafiniti_unique$categories)

## [1] "Pizza,Restaurant,American restaurants,Pizza Place,Restaurants"                  
## [2] "Pizza,Pizza Place,Restaurants"                                                  
## [3] "Restaurant,Pizza Place,Restaurants"                                             
## [4] "Pizza,Carry-out food,Pizza Place,Restaurants"                                   
## [5] "Pizza,American restaurants,Pizza Place,Pizza equipment and supplies,Restaurants"
## [6] "Pizza Place"

We accomplished this by:

Creating a unique id for each pizza place

##   loc_id                name
## 1      1  Shotgun Dans Pizza
## 2      2    Sauce Pizza Wine
## 3      3       Mios Pizzeria
## 4      4 Hungry Howies Pizza
## 5      5    Spartan Pizzeria
## 6      6             La Vals

Separating the different categories (separated by a comma) into individual columns

# Separate categories

datafiniti_cat <- datafiniti_unique %>% separate(categories,
                                                  c("Cat1", "Cat2", "Cat3", "Cat4", "Cat5","Cat6", 
                                                  "Cat7", "Cat8", "Cat9", "Cat10", "Cat11"), sep = ",")

Creating a new dataframe with only the unique category id and separated categories

##   loc_id        Cat1                 Cat2                 Cat3
## 1      1       Pizza           Restaurant American restaurants
## 2      2       Pizza          Pizza Place          Restaurants
## 3      3  Restaurant          Pizza Place          Restaurants
## 4      4       Pizza       Carry-out food          Pizza Place
## 5      5       Pizza American restaurants          Pizza Place
## 6      6 Pizza Place                 <NA>                 <NA>
##                           Cat4        Cat5 Cat6 Cat7 Cat8 Cat9 Cat10
## 1                  Pizza Place Restaurants <NA> <NA> <NA> <NA>  <NA>
## 2                         <NA>        <NA> <NA> <NA> <NA> <NA>  <NA>
## 3                         <NA>        <NA> <NA> <NA> <NA> <NA>  <NA>
## 4                  Restaurants        <NA> <NA> <NA> <NA> <NA>  <NA>
## 5 Pizza equipment and supplies Restaurants <NA> <NA> <NA> <NA>  <NA>
## 6                         <NA>        <NA> <NA> <NA> <NA> <NA>  <NA>

Gathering the categories into one column and removing NA values

# gather categories

datafiniti_cat <- 
  datafiniti_cat %>% 
  gather(Cat, Category, 2:11) %>% 
  arrange(loc_id)

# remove NA categories

datafiniti_cat <- datafiniti_cat[!is.na(datafiniti_cat$Category), ]

# remove Cat column

datafiniti_cat <- datafiniti_cat[ , c(1,3)]

head(datafiniti_cat)

##    loc_id             Category
## 1       1                Pizza
## 2       1           Restaurant
## 3       1 American restaurants
## 4       1          Pizza Place
## 5       1          Restaurants
## 11      2                Pizza

Shrinking the list into unique values

# unique categories w\ count

datafiniti_cat %>% 
  group_by(Category) %>% 
  summarize(count = n()) %>% 
  arrange(desc(count))

## # A tibble: 308 x 2
##    Category                           count
##    <chr>                              <int>
##  1 Pizza Place                         2286
##  2 Restaurant                          1104
##  3 Pizza                                655
##  4 Restaurants                          650
##  5 Italian Restaurant                   225
##  6 Italian                               71
##  7 Caterers                              49
##  8 Italian Restaurants                   47
##  9 American restaurants                  45
## 10 Italian Restaurant and Pizza Place    43
## # ... with 298 more rows

Cleaning and consolidating the categories (#### unique)

# Clean \ combine categories

datafiniti_cat$Category[datafiniti_cat$Category == "Restaurants"] <- "Restaurant"
datafiniti_cat$Category[datafiniti_cat$Category == "Italian Restaurants"] <- "Italian Restaurant"
datafiniti_cat$Category[datafiniti_cat$Category == "Italian"] <- "Italian Restaurant"
datafiniti_cat$Category[datafiniti_cat$Category == "Italian restaurants"] <- "Italian Restaurant"
datafiniti_cat$Category[datafiniti_cat$Category == " and Italian restaurant"] <- "Italian Restaurant"
datafiniti_cat$Category[datafiniti_cat$Category == "Pizza"] <- "Pizza Place"
datafiniti_cat$Category[datafiniti_cat$Category == "Pizza Restaurants"] <- "Pizza Place"
datafiniti_cat$Category[datafiniti_cat$Category == "Catering"] <- "Caterers"
datafiniti_cat$Category[datafiniti_cat$Category == "American restaurants"] <- "American Restaurant"
datafiniti_cat$Category[datafiniti_cat$Category == "American Restaurants"] <- "American Restaurant"
datafiniti_cat$Category[datafiniti_cat$Category == "wich Place"] <- "Sandwich Place"
datafiniti_cat$Category[datafiniti_cat$Category == " Sandwich Place"] <- "Sandwich Place"
datafiniti_cat$Category[datafiniti_cat$Category == "American restaurants"] <- "American Restaurant"
datafiniti_cat$Category[!str_detect(datafiniti_cat$Category, "^Pizza Place and") &
                          str_detect(datafiniti_cat$Category, "^Pizza Place ")] <- "Pizza Place"
datafiniti_cat$Category[datafiniti_cat$Category == "Pizza Place and Rock Club Charles North"] <- "Pizza Place and Rock Club"
datafiniti_cat$Category[datafiniti_cat$Category == "Bars"] <- "Bar"
datafiniti_cat$Category[datafiniti_cat$Category == " Bar"] <- "Bar"
datafiniti_cat$Category[datafiniti_cat$Category == " and Bar"] <- "Bar"
datafiniti_cat$Category[datafiniti_cat$Category == " and Pizza Place"] <- "Pizza Place"
datafiniti_cat$Category[datafiniti_cat$Category == " and Diner"] <- "Diner"
datafiniti_cat$Category[datafiniti_cat$Category == "Ne\\\\pizza Place"] <- "Pizza Place"
datafiniti_cat$Category[datafiniti_cat$Category == "Fl\\\\pizza Place"] <- "Pizza Place"
datafiniti_cat$Category[datafiniti_cat$Category == "Fl\\\\pizza Place"] <- "Pizza Place"

#rename category

names(datafiniti_cat)[names(datafiniti_cat) == 'Category'] <- 'new_cat'

datafiniti_cat %>% 
  group_by(new_cat) %>% 
  summarize(count = n()) %>% 
  arrange(desc(count))

## # A tibble: 158 x 2
##    new_cat                            count
##    <chr>                              <int>
##  1 Pizza Place                         3121
##  2 Restaurant                          1754
##  3 Italian Restaurant                   376
##  4 Caterers                              75
##  5 American Restaurant                   73
##  6 Italian Restaurant and Pizza Place    43
##  7 Bar                                   41
##  8 Pizza Place and Italian Restaurant    31
##  9 Take Out Restaurants                  29
## 10 Sandwich Place                        25
## # ... with 148 more rows

Merging with the original dataset to end up with a row for each pizza place & category for analysis by category

# merge w\ new categories & delete original categories

datafiniti_new <- merge(datafiniti_unique, datafiniti_cat, by = 'loc_id') 
datafiniti_new <- datafiniti_new %>% 
  select(-categories)

head(datafiniti_new)

##   loc_id               name           address     city country State Code
## 1      1 Shotgun Dans Pizza  4203 E Kiehl Ave Sherwood      US         AR
## 2      1 Shotgun Dans Pizza  4203 E Kiehl Ave Sherwood      US         AR
## 3      1 Shotgun Dans Pizza  4203 E Kiehl Ave Sherwood      US         AR
## 4      1 Shotgun Dans Pizza  4203 E Kiehl Ave Sherwood      US         AR
## 5      1 Shotgun Dans Pizza  4203 E Kiehl Ave Sherwood      US         AR
## 6      2   Sauce Pizza Wine 25 E Camelback Rd  Phoenix      US         AZ
##   latitude longitude price_range_min price_range_max             new_cat
## 1 34.83230  -92.1838               0              25         Pizza Place
## 2 34.83230  -92.1838               0              25          Restaurant
## 3 34.83230  -92.1838               0              25 American Restaurant
## 4 34.83230  -92.1838               0              25         Pizza Place
## 5 34.83230  -92.1838               0              25          Restaurant
## 6 33.50927 -112.0730               0              25         Pizza Place

Proposed EDA

For the datafiniti data set, we’d like to look at
Histograms / Boxplots of the prices
Look at prices by region and state
Display results on a map (potentially including barstool)
Average prices by category (potentially including barstool)
For the barstool and jared data sets we’d like to look at
Descriptive statistics for ratings
Rating by location / state
Rating by category
Compare ratings by reviewer