Introduction

In the analysis of these 3 datasets, we are addressing average rating, location, and pricing information of various pizza establishments around the US. The goal is to assist the average American consumer in making informed decisions about which pizza restaurants they would like to patronize.

Data Introduction

This analysis is based on 3 datasets with different information about pizza restaurants in the US:

  • Barstool - Contains critic, public and the barstool staff’s rating as well as pricing, location and geolocation of the different pizza places.

  • Jared - Contains ratings for New York pizza restaurants on a 6-point likert scale.

  • Datafiniti - Contains price range, location and keywords for pizza restaurants around the US.

  • Region - Contains state by region from the US Census Bureau

To learn more about the datasets used, click here or view a more detailed explanation in the Data Preparation section below.

Approach

Our approach was to divide the analysis to focus on two groups of consumers: 1) those who are price-conscious and 2) those who are interested in finding the pizza place near them with the highest rating.

Packages

library(tidyverse) # data manipulation and cleaning
library(knitr) # data dictionary tables
library(kableExtra) # table formatting
library(readxl) # importing dataset descriptions
library(leaflet) # interactive map & geolocation

Data Dictionary

Background

The pizza data was originally compiled on September 30th, 2019 as a part of a social data project called TidyTuesday. The original intent is for R users to practice their creativity and analytical techniques, specifically with the tidyverse package.

For those who are interested, there’s a 5-minute podcast reviewing the top entries and different visualization techniques for Week 7’s project called “Pizza Party”.

We created the regions dataset based on the definitions from the US Census Bureau.

Data Dictionary

Jared Dataset

variable class description
polla_qid integer Quiz ID
answer character Answer (likert scale)
votes integer Number of votes for that question/answer combo
pollq_id integer Poll Question ID
question character Question
place character Pizza Place
time integer Time of quiz
total_votes integer Total number of votes for that pizza place
percent double Vote percent of total for that pizza place

Barstool Dataset

variable class description
name character Pizza place name
address1 character Pizza place address
city character City
zip double Zip
country character Country
latitude double Latitude
longitude double Longitude
price_level double Price rating (smaller = cheaper)
provider_rating double Provider review score
provider_review_count double Provider review count
review_stats_all_average_score double Average Score
review_stats_all_count double Count of all reviews
review_stats_all_total_score double Review total score
review_stats_community_average_score double Community average score
review_stats_community_count double community review count
review_stats_community_total_score double community review total score
review_stats_critic_average_score double Critic average score
review_stats_critic_count double Critic review count
review_stats_critic_total_score double Critic total score
review_stats_dave_average_score double Dave (Barstool) average score
review_stats_dave_count double Dave review count
review_stats_dave_total_score double Dave total score

Datafiniti Dataset

variable class description
polla_qid integer Quiz ID
answer character Answer (likert scale)
votes integer Number of votes for that question/answer combo
pollq_id integer Poll Question ID
question character Question
place character Pizza Place
time integer Time of quiz
total_votes integer Total number of votes for that pizza place
percent double Vote percent of total for that pizza place

Data Preparation

NA Values

colSums(is.na(barstool))
##                                 name                             address1 
##                                    0                                    0 
##                                 city                                  zip 
##                                    0                                    0 
##                              country                             latitude 
##                                    0                                    2 
##                            longitude                          price_level 
##                                    2                                    0 
##                      provider_rating                provider_review_count 
##                                    0                                    0 
##       review_stats_all_average_score               review_stats_all_count 
##                                    0                                    0 
##         review_stats_all_total_score review_stats_community_average_score 
##                                    0                                    0 
##         review_stats_community_count   review_stats_community_total_score 
##                                    0                                    0 
##    review_stats_critic_average_score            review_stats_critic_count 
##                                    0                                    0 
##      review_stats_critic_total_score      review_stats_dave_average_score 
##                                    0                                    0 
##              review_stats_dave_count        review_stats_dave_total_score 
##                                    0                                    0
colSums(is.na(jared))
##   polla_qid      answer       votes    pollq_id    question       place 
##           0           0           0           0           0           0 
##        time total_votes     percent 
##           0           0           5
colSums(is.na(datafiniti))
##            name         address            city         country 
##               0               0               0               0 
##        province        latitude       longitude      categories 
##               0               0               0               0 
## price_range_min price_range_max 
##               0               0
colSums(is.na(region))
##      State State Code     Region   Division 
##          0          0          0          0

Jared & Barstool Cleaning

Likert Scale Conversion

The jared dataset pizza ratings are on a 1 to 6 likert scale. However, barstool is on a 1-10 scale. We used a mutate function to convert the 1-6 scale to a 1-10 scale:

jared <- jared %>%
  mutate(answer = case_when(
    .$answer=="Never Again" ~ 0, 
    .$answer=="Poor" ~ 2,
    .$answer=="Fair" ~ 4,
    .$answer=="Average" ~ 6,
    .$answer=="Good"~ 8,
    .$answer=="Excellent" ~ 10))

jared$answer <- as.integer(jared$answer)
str(jared$answer)
##  int [1:375] 10 8 6 2 0 10 8 6 2 0 ...

Once the answers are converted to integers, a calculation is done to find the average rating for each pizza place:

jared <- mutate(jared,Weighted_Rating = answer*votes)


(Jared_Average <- jared %>%
    group_by(place) %>%
    summarise(avg_score = sum(Weighted_Rating)/sum(votes)))
## # A tibble: 56 x 2
##    place                       avg_score
##    <chr>                           <dbl>
##  1 5 Boroughs Pizza                 7.33
##  2 Artichoke Basille's Pizza        8   
##  3 Arturo's                         7.43
##  4 Bella Napoli                     7.07
##  5 Ben's of SoHo 14th Street        4.8 
##  6 Ben's of SoHo Spring Street      6.44
##  7 Big Slice Pizza                  6.27
##  8 Bleecker Street Pizza            8.29
##  9 Bravo Pizza                    NaN   
## 10 Cavallo's Pizza                  7.27
## # ... with 46 more rows
(Jared_Mean <- mean(Jared_Average$avg_score, na.rm = T))
## [1] 6.802782

Two columns are dropped for being insignificant:

jared$polla_qid = NULL
jared$time = NULL
head(jared)

This calculation takes the mean of average scores omiting the 0 values

Critic_average <- mean(NA^(barstool$review_stats_critic_average_score == 0)*barstool$review_stats_critic_average_score, na.rm = TRUE)
Daves_average <- mean(NA^(barstool$review_stats_dave_average_score == 0)*barstool$review_stats_dave_average_score, na.rm = TRUE)
Community_average <- mean(NA^(barstool$review_stats_community_average_score == 0)*barstool$review_stats_community_average_score, na.rm = TRUE)

The three calculations below allow to compare to average scores found in the barstool dataset (Critic, Community and Barstool Staff (Dave)) with the scores in the jared dataset.

Represents the values that are in both the barstool and jared datasets

##  [1] "Williamsburg Pizza"        "Little Italy Pizza"       
##  [3] "Steve's Pizza"             "Girello"                  
##  [5] "5 Boroughs Pizza"          "Saluggi's"                
##  [7] "Artichoke Basille's Pizza" "Bleecker Street Pizza"    
##  [9] "Joe's Pizza"               "Champion Pizza"           
## [11] "Prince Street Pizza"       "Kiss My Slice"            
## [13] "Arturo's"                  "Vinny Vincenz"            
## [15] "Stella's Pizza"            "Gotham Pizza"             
## [17] "Highline Pizza"            "Pizza Italia"             
## [19] "Rivoli Pizza"              "NY Pizza Suprema"         
## [21] "Previti Pizza"             "Rocco's Pizza Joint"

Datafiniti Cleaning

Since there weren’t any NA values in the datafiniti dataset, we simply removed duplicate values and changed the name of “province”" to “State Code” to align with the column name in the region dataset.

# Remove duplicates

datafiniti_unique <- unique(datafiniti)
datafiniti_unique # 2,285 unique observations
## # A tibble: 2,285 x 10
##    name  address city  country province latitude longitude categories
##    <chr> <chr>   <chr> <chr>   <chr>       <dbl>     <dbl> <chr>     
##  1 Shot~ 4203 E~ Sher~ US      AR           34.8     -92.2 Pizza,Res~
##  2 Sauc~ 25 E C~ Phoe~ US      AZ           33.5    -112.  Pizza,Piz~
##  3 Mios~ 3703 P~ Cinc~ US      OH           39.1     -84.4 Restauran~
##  4 Hung~ 30495 ~ Madi~ US      MI           42.5     -83.1 Pizza,Car~
##  5 Spar~ 3600 E~ Balt~ US      MD           39.3     -76.6 Pizza,Ame~
##  6 La V~ 1834 E~ Berk~ US      CA           37.9    -122.  Pizza Pla~
##  7 Bric~ 4819 K~ Tall~ US      FL           30.5     -84.2 Pizza Pla~
##  8 Carr~ 4061 2~ Gran~ US      MI           42.9     -85.6 Restauran~
##  9 Dome~ 146 N ~ Glen~ US      CA           34.1    -118.  Pizza,Res~
## 10 Litt~ 965 N ~ El P~ US      TX           31.9    -107.  Restauran~
## # ... with 2,275 more rows, and 2 more variables: price_range_min <dbl>,
## #   price_range_max <dbl>
# rename province

names(datafiniti_unique)[names(datafiniti_unique) == 'province'] <- 'State Code'

Our next task was to make sense of the categories column, which originally contained values in one column:

head(datafiniti_unique$categories)
## [1] "Pizza,Restaurant,American restaurants,Pizza Place,Restaurants"                  
## [2] "Pizza,Pizza Place,Restaurants"                                                  
## [3] "Restaurant,Pizza Place,Restaurants"                                             
## [4] "Pizza,Carry-out food,Pizza Place,Restaurants"                                   
## [5] "Pizza,American restaurants,Pizza Place,Pizza equipment and supplies,Restaurants"
## [6] "Pizza Place"

We accomplished this by:

  1. Creating a unique id for each pizza place
##   loc_id                name
## 1      1  Shotgun Dans Pizza
## 2      2    Sauce Pizza Wine
## 3      3       Mios Pizzeria
## 4      4 Hungry Howies Pizza
## 5      5    Spartan Pizzeria
## 6      6             La Vals
  1. Separating the different categories (separated by a comma) into individual columns
# Separate categories

datafiniti_cat <- datafiniti_unique %>% separate(categories, c("Cat1", "Cat2", "Cat3", "Cat4","Cat5",
                                                               "Cat6", "Cat7", "Cat8", "Cat9", "Cat10", 
                                                               "Cat11"), sep = ",")
  1. Creating a new dataframe with only the unique category id and separated categories
##   loc_id        Cat1                 Cat2                 Cat3
## 1      1       Pizza           Restaurant American restaurants
## 2      2       Pizza          Pizza Place          Restaurants
## 3      3  Restaurant          Pizza Place          Restaurants
## 4      4       Pizza       Carry-out food          Pizza Place
## 5      5       Pizza American restaurants          Pizza Place
## 6      6 Pizza Place                 <NA>                 <NA>
##                           Cat4        Cat5 Cat6 Cat7 Cat8 Cat9 Cat10
## 1                  Pizza Place Restaurants <NA> <NA> <NA> <NA>  <NA>
## 2                         <NA>        <NA> <NA> <NA> <NA> <NA>  <NA>
## 3                         <NA>        <NA> <NA> <NA> <NA> <NA>  <NA>
## 4                  Restaurants        <NA> <NA> <NA> <NA> <NA>  <NA>
## 5 Pizza equipment and supplies Restaurants <NA> <NA> <NA> <NA>  <NA>
## 6                         <NA>        <NA> <NA> <NA> <NA> <NA>  <NA>
  1. Gathering the categories into one column and removing NA values
##    loc_id             Category
## 1       1                Pizza
## 2       1           Restaurant
## 3       1 American restaurants
## 4       1          Pizza Place
## 5       1          Restaurants
## 11      2                Pizza
  1. Shrinking the list into unique values
# unique categories w/ count

datafiniti_cat %>% 
  group_by(Category) %>% 
  summarize(count = n()) %>% 
  arrange(desc(count))
## # A tibble: 308 x 2
##    Category                           count
##    <chr>                              <int>
##  1 Pizza Place                         2286
##  2 Restaurant                          1104
##  3 Pizza                                655
##  4 Restaurants                          650
##  5 Italian Restaurant                   225
##  6 Italian                               71
##  7 Caterers                              49
##  8 Italian Restaurants                   47
##  9 American restaurants                  45
## 10 Italian Restaurant and Pizza Place    43
## # ... with 298 more rows
  1. Repeating steps 2-5, separating this time on the word “and”. This allowed us to further separate categories such as “Italian Restaurant and Pizza Place” into their own values.
## # A tibble: 312 x 2
##    Category            count
##    <chr>               <int>
##  1 Pizza Place          2286
##  2 Restaurant           1104
##  3 Pizza                 655
##  4 Restaurants           650
##  5 Italian Restaurant    225
##  6 "Pizza Place "         81
##  7 Italian                71
##  8 " Pizza Place"         53
##  9 Caterers               49
## 10 Italian Restaurants    47
## # ... with 302 more rows
  1. Cleaning and consolidating into 10 main categories and an “Other” category containing descriptions with < 10 records each.
Clean_cat count
Pizza Place 3228
Restaurant 1821
Italian 462
Other 153
American 93
Bar / Pub 83
Caterer 75
Pizza Restaurant 46
Sandwich Place 37
Delivery / Carry-out 23
Karaoke 10
loc_id Category Clean_cat
1 Pizza Pizza Place
1 Restaurant Restaurant
1 American restaurants American
1 Pizza Place Pizza Place
1 Restaurants Restaurant
2 Pizza Pizza Place
2 Pizza Place Pizza Place
2 Restaurants Restaurant
3 Restaurant Restaurant
3 Pizza Place Pizza Place
  1. Merging with the original dataset to end up with a row for each pizza place & category for analysis by category
# merge w\ new categories & delete original categories

datafiniti_new <- merge(datafiniti_unique, datafiniti_cat, by = 'loc_id') 
datafiniti_new <- datafiniti_new %>% 
  select(-c(categories, Category))

datafiniti_new <- unique(datafiniti_new)

kable(head(select(datafiniti_new, c(loc_id, name, new_cat)), 10)) %>% 
   kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
loc_id name new_cat
1 1 Shotgun Dans Pizza Pizza Place
2 1 Shotgun Dans Pizza Restaurant
3 1 Shotgun Dans Pizza American
6 2 Sauce Pizza Wine Pizza Place
8 2 Sauce Pizza Wine Restaurant
9 3 Mios Pizzeria Restaurant
10 3 Mios Pizzeria Pizza Place
12 4 Hungry Howies Pizza Pizza Place
13 4 Hungry Howies Pizza Delivery / Carry-out
15 4 Hungry Howies Pizza Restaurant

Exploratory Analysis

First, we looked at a brief overview of prices by category:

datafiniti_new %>% 
  group_by(new_cat) %>% 
  summarize(Avg_Min_Price = mean(price_range_min), Avg_Max_Price = mean(price_range_max)) %>% 
  mutate(Range = Avg_Max_Price - Avg_Min_Price) %>% 
  arrange(Avg_Min_Price)
## # A tibble: 11 x 4
##    new_cat              Avg_Min_Price Avg_Max_Price Range
##    <chr>                        <dbl>         <dbl> <dbl>
##  1 Sandwich Place                2             26.2  24.2
##  2 Pizza Restaurant              2.52          26.6  24.1
##  3 Caterer                       4.3           27.6  23.3
##  4 Restaurant                    4.41          27.6  23.2
##  5 Delivery / Carry-out          4.55          27.7  23.2
##  6 Pizza Place                   4.67          27.8  23.1
##  7 Karaoke                       5             28    23  
##  8 Other                         6.16          28.6  22.4
##  9 American                      7.57          29.4  21.8
## 10 Italian                       9.74          30.9  21.2
## 11 Bar / Pub                    10.8           31.5  20.7

Next, we delved a little deeper into the pricing structure:

# Price Exploration -------------------------------------------------------

# min \ max summary

summary(datafiniti_unique$price_range_min) # 3rd quartile is still 0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00    4.67    0.00   50.00
summary(datafiniti_unique$price_range_max) # ranges from 7 to 55
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     7.0    25.0    25.0    27.8    25.0    55.0
# create spread variable

datafiniti_unique <- datafiniti_unique %>% 
  mutate(Range = price_range_max - price_range_min)

datafiniti_new <- datafiniti_new %>% 
  mutate(Range = price_range_max - price_range_min)

summary(datafiniti_unique$Range)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   25.00   25.00   23.13   25.00   34.00
summary(datafiniti_new$Range)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   25.00   25.00   22.95   25.00   34.00
# histogram for each

datafiniti_unique %>% 
  ggplot(aes(x = price_range_min)) +
  geom_histogram(binwidth = 2) +
  coord_cartesian(xlim = c(0,40))  

datafiniti_unique %>% 
  ggplot(aes(x = price_range_max)) +
  geom_histogram(binwidth = 2) +
  coord_cartesian(xlim = c(20,55))

datafiniti_unique %>% 
  ggplot(aes(x = Range)) +
  geom_histogram(binwidth = 2) +
  coord_cartesian(xlim = c(10,30))

# boxplot for each

datafiniti_new %>% 
  ggplot(aes(x = new_cat, y = price_range_min)) +
  geom_boxplot() +
  coord_flip()

datafiniti_new %>% 
  ggplot(aes(x = new_cat, y = price_range_max)) +
  geom_boxplot() +
  coord_flip()

datafiniti_new %>% 
  ggplot(aes(x = new_cat, y = Range)) +
  geom_boxplot() +
  coord_flip()

# Price by Region / State ------------------------------------------------------------------

# add Region to datafiniti_unique

datafiniti_unique <- 
  datafiniti_unique %>% 
  merge(region, by = "State Code") %>% 
  select(-(Division))

# Min / max by region

datafiniti_unique %>% 
  group_by(Region) %>% 
  summarize(Avg_min_price = mean(price_range_min),
            Avg_max_price = mean(price_range_max))
## # A tibble: 4 x 3
##   Region    Avg_min_price Avg_max_price
##   <chr>             <dbl>         <dbl>
## 1 Midwest            4.35          27.6
## 2 Northeast          3.53          27.0
## 3 South              5.14          28.1
## 4 West               5.75          28.5
# Min / max by state

datafiniti_unique %>% 
  group_by(State) %>% 
  summarize(Avg_max_price = mean(price_range_max)) %>% 
  arrange(desc(Avg_max_price))
## # A tibble: 44 x 2
##    State         Avg_max_price
##    <chr>                 <dbl>
##  1 Connecticut            40  
##  2 South Dakota           32.5
##  3 Iowa                   31  
##  4 Alabama                30  
##  5 Nebraska               30  
##  6 Arizona                29.9
##  7 West Virginia          29.3
##  8 Maryland               29.2
##  9 Louisiana              29.1
## 10 Oregon                 29.1
## # ... with 34 more rows
datafiniti_unique %>% 
  group_by(State) %>% 
  summarize(Avg_min_price = mean(price_range_min)) %>% 
  arrange(Avg_min_price)
## # A tibble: 44 x 2
##    State          Avg_min_price
##    <chr>                  <dbl>
##  1 Alaska                  0   
##  2 Delaware                0   
##  3 Hawaii                  0   
##  4 Montana                 0   
##  5 New Jersey              0   
##  6 North Dakota            0   
##  7 Nevada                  1.25
##  8 South Carolina          1.85
##  9 Arkansas                1.92
## 10 Kansas                  2.08
## # ... with 34 more rows

Barstool Restaurant Locations

Locations of the restaurants in the barstool dataset

m <- leaflet(barstool) %>% addTiles()
m %>% addCircles(lng = ~ barstool$longitude, lat = ~ barstool$latitude, popup = barstool$nameL, weight = 8, color = "#fb3004", stroke = TRUE)

This map shows the barstool restaurants which have been reviewed. The circles make it easy to portray a visual of where most reviews are from. For example, zooming in on New York will show hundreds of restaurants.

Datafiniti Point Plot

ggplot(data = datafiniti, aes(x=price_range_max,y=price_range_min)) + 
  geom_point(color = "blue", size=2,shape=17) 

The three calculations below allow to compare to average scores found in the barstool dataset (Critic, Community and Barstool Staff (Dave)) with the scores in the jared dataset.

Pizza_Averages <- c("Dave (Barstool)","Community","Critic","Jared")
y <- c(6.622,7.0846,7.256,6.802)
barplot(y,names.arg = Pizza_Averages,xlab = "Average Review",ylab = "Rating",col = "blue", main = "Pizza Averages")

Modeling

For the second part of my capstone project I wanted to take a “deep” dive into the data and look at different models, relationships, ratings, and mathematical values of the different datasets.

Scatter plots are a great way of examining two numerical variables in their relationship to each other. The first plot I examined was the barstool dataset which included price level and the provider ratings. It is interesting to see that as price increases, not all the ratings do. However, all ratings are in between are a 3 or 4 out of a 1-5 scale. More expensive pizza eliminates any one or two star ratings, but does not gurantee the best pizza.

plot(barstool$price_level, barstool$provider_rating, xlab = "Price", ylab = "Rating")

Modeling is extremely useful in understanding different variable relationships, however simple functions can still show valuable information. I would first like to show the mean from the averages in the community, Dave, critic reviews, and all average scores.

barstool_table <- matrix(c('6.45729','6.876439','6.622721','0.9716847'),ncol = 4, byrow = TRUE)
colnames(barstool_table) <- c('Community','All','Dave','Critic')
barstool_table
##      Community All        Dave       Critic     
## [1,] "6.45729" "6.876439" "6.622721" "0.9716847"

After reviewing the table, the critic average score is extremely low. This is due to a large number of NA replies from the same restaurants. Perhaps a histogram would be more beneficial to see the distribution. Once the histogram is shown, if you ignore the 0s, most responses are in the 6-7 range, consistent with the other reviewers.

hist(barstool$review_stats_critic_average_score)

I next want to go ahead and build a GLM model. To first do this, I want to check all my continuous variables.

library(dplyr)
continuous <- select_if(barstool,is.numeric)
summary(continuous)
##       zip           latitude       longitude        price_level  
##  Min.   : 1748   Min.   :25.79   Min.   :-122.41   Min.   :0.00  
##  1st Qu.:10009   1st Qu.:40.72   1st Qu.: -74.09   1st Qu.:1.00  
##  Median :10019   Median :40.75   Median : -73.99   Median :1.00  
##  Mean   :18531   Mean   :40.19   Mean   : -77.44   Mean   :1.46  
##  3rd Qu.:11234   3rd Qu.:40.78   3rd Qu.: -73.97   3rd Qu.:2.00  
##  Max.   :94133   Max.   :45.00   Max.   : -70.09   Max.   :3.00  
##                  NA's   :2       NA's   :2                       
##  provider_rating provider_review_count review_stats_all_average_score
##  Min.   :2.000   Min.   :   2.0        Min.   :0.100                 
##  1st Qu.:3.500   1st Qu.:  74.0        1st Qu.:6.240                 
##  Median :3.500   Median : 169.0        Median :7.162                 
##  Mean   :3.671   Mean   : 386.1        Mean   :6.876                 
##  3rd Qu.:4.000   3rd Qu.: 392.0        3rd Qu.:7.809                 
##  Max.   :5.000   Max.   :5797.0        Max.   :9.079                 
##                                                                      
##  review_stats_all_count review_stats_all_total_score
##  Min.   :  1.00         Min.   :   0.10             
##  1st Qu.:  4.00         1st Qu.:  23.65             
##  Median :  8.00         Median :  54.10             
##  Mean   : 19.02         Mean   : 149.93             
##  3rd Qu.: 19.00         3rd Qu.: 140.20             
##  Max.   :568.00         Max.   :5045.60             
##                                                     
##  review_stats_community_average_score review_stats_community_count
##  Min.   : 0.000                       Min.   :  0.00              
##  1st Qu.: 6.075                       1st Qu.:  3.00              
##  Median : 7.225                       Median :  7.00              
##  Mean   : 6.457                       Mean   : 17.87              
##  3rd Qu.: 7.873                       3rd Qu.: 18.00              
##  Max.   :10.000                       Max.   :567.00              
##                                                                   
##  review_stats_community_total_score review_stats_critic_average_score
##  Min.   :   0.00                    Min.   : 0.0000                  
##  1st Qu.:  15.65                    1st Qu.: 0.0000                  
##  Median :  47.30                    Median : 0.0000                  
##  Mean   : 142.28                    Mean   : 0.9717                  
##  3rd Qu.: 135.10                    3rd Qu.: 0.0000                  
##  Max.   :5036.30                    Max.   :11.0000                  
##                                                                      
##  review_stats_critic_count review_stats_critic_total_score
##  Min.   :0.0000            Min.   : 0.000                 
##  1st Qu.:0.0000            1st Qu.: 0.000                 
##  Median :0.0000            Median : 0.000                 
##  Mean   :0.1425            Mean   : 1.023                 
##  3rd Qu.:0.0000            3rd Qu.: 0.000                 
##  Max.   :5.0000            Max.   :29.800                 
##                                                           
##  review_stats_dave_average_score review_stats_dave_count
##  Min.   : 0.080                  Min.   :1              
##  1st Qu.: 6.200                  1st Qu.:1              
##  Median : 7.100                  Median :1              
##  Mean   : 6.623                  Mean   :1              
##  3rd Qu.: 7.800                  3rd Qu.:1              
##  Max.   :10.000                  Max.   :1              
##                                                         
##  review_stats_dave_total_score
##  Min.   : 0.080               
##  1st Qu.: 6.200               
##  Median : 7.100               
##  Mean   : 6.623               
##  3rd Qu.: 7.800               
##  Max.   :10.000               
## 

Next, I want to go ahead and check all factor variables in my barstool dataset.

factor <- data.frame(select_if(barstool, is.factor))
ncol(factor)
## [1] 0

The barstool dataset also contained a variety of different different reviews, which contained providers, crtics, the reviewer named “Dave,” etc. I next wanted to build a model to see how the price compared to all these different reviews. I built a logistic regression model which split my data to training (70%) and testing (30%). To make things easier, I created a barstool2 dataset in which I eliminated all the character variables.

barstool2 <- read_csv("pizza_barstool2.csv")
## Parsed with column specification:
## cols(
##   latitude = col_double(),
##   longitude = col_double(),
##   price_level = col_double(),
##   provider_rating = col_double(),
##   provider_review_count = col_double(),
##   review_stats_all_average_score = col_double(),
##   review_stats_all_count = col_double(),
##   review_stats_all_total_score = col_double(),
##   review_stats_community_average_score = col_double(),
##   review_stats_community_count = col_double(),
##   review_stats_community_total_score = col_double(),
##   review_stats_critic_average_score = col_double(),
##   review_stats_critic_count = col_double(),
##   review_stats_critic_total_score = col_double(),
##   review_stats_dave_average_score = col_double(),
##   review_stats_dave_count = col_double(),
##   review_stats_dave_total_score = col_double()
## )
nrow(barstool2)
## [1] 463
index <- sample(nrow(barstool2),463,nrow(barstool2)*70)
barstool.train <- barstool2[index,]
barstool.test <- barstool2[-index,]

Next, I am going to train a logistic regression model with a poisson distribution. I am prediciting an outcome variable with count data and review data. I first used Daves score as my independent variable to see the relationship it had with different reviewers.

barstool.glm0 <- glm(review_stats_dave_average_score~. , family = poisson, data = barstool.train)
summary(barstool.glm0)
## 
## Call:
## glm(formula = review_stats_dave_average_score ~ ., family = poisson, 
##     data = barstool.train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.66630  -0.07488   0.03486   0.10274   0.28923  
## 
## Coefficients: (3 not defined because of singularities)
##                                        Estimate Std. Error z value
## (Intercept)                           3.730e-01  4.134e-01   0.902
## latitude                             -7.349e-05  4.927e-03  -0.015
## longitude                            -6.018e-04  2.051e-03  -0.293
## price_level                          -7.953e-03  3.228e-02  -0.246
## provider_rating                      -2.200e-03  4.211e-02  -0.052
## provider_review_count                 5.180e-06  3.181e-05   0.163
## review_stats_all_average_score        1.674e-02  3.090e-02   0.542
## review_stats_all_count                1.073e-01  2.563e-01   0.419
## review_stats_all_total_score          1.897e-01  2.087e-02   9.089
## review_stats_community_average_score -4.597e-03  1.097e-02  -0.419
## review_stats_community_count         -1.050e-01  2.552e-01  -0.411
## review_stats_community_total_score   -1.901e-01  2.079e-02  -9.143
## review_stats_critic_average_score     6.009e-03  2.228e-02   0.270
## review_stats_critic_count                    NA         NA      NA
## review_stats_critic_total_score      -2.069e-01  5.216e-02  -3.966
## review_stats_dave_count                      NA         NA      NA
## review_stats_dave_total_score                NA         NA      NA
##                                      Pr(>|z|)    
## (Intercept)                             0.367    
## latitude                                0.988    
## longitude                               0.769    
## price_level                             0.805    
## provider_rating                         0.958    
## provider_review_count                   0.871    
## review_stats_all_average_score          0.588    
## review_stats_all_count                  0.676    
## review_stats_all_total_score          < 2e-16 ***
## review_stats_community_average_score    0.675    
## review_stats_community_count            0.681    
## review_stats_community_total_score    < 2e-16 ***
## review_stats_critic_average_score       0.787    
## review_stats_critic_count                  NA    
## review_stats_critic_total_score      7.32e-05 ***
## review_stats_dave_count                    NA    
## review_stats_dave_total_score              NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 275.445  on 462  degrees of freedom
## Residual deviance:  23.508  on 449  degrees of freedom
## AIC: Inf
## 
## Number of Fisher Scoring iterations: 4

I then built another model examining the relationship to the testing dataset.

barstool.glm1 <- glm(review_stats_dave_average_score~. , family = poisson, data = barstool.test)
summary(barstool.glm1)
## 
## Call:
## glm(formula = review_stats_dave_average_score ~ ., family = poisson, 
##     data = barstool.test)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.35941  -0.09083   0.02952   0.13309   0.30522  
## 
## Coefficients: (4 not defined because of singularities)
##                                        Estimate Std. Error z value
## (Intercept)                           2.821e-01  6.427e-01   0.439
## latitude                             -1.267e-03  8.099e-03  -0.156
## longitude                             5.723e-05  2.858e-03   0.020
## price_level                          -1.198e-02  5.766e-02  -0.208
## provider_rating                      -2.895e-02  6.396e-02  -0.453
## provider_review_count                 2.894e-06  5.375e-05   0.054
## review_stats_all_average_score        3.882e-02  6.297e-02   0.617
## review_stats_all_count                2.521e-01  4.278e-01   0.589
## review_stats_all_total_score          1.937e-01  4.061e-02   4.771
## review_stats_community_average_score -9.409e-03  2.374e-02  -0.396
## review_stats_community_count         -2.574e-01  4.295e-01  -0.599
## review_stats_community_total_score   -1.932e-01  3.986e-02  -4.847
## review_stats_critic_average_score    -2.233e-01  6.994e-02  -3.192
## review_stats_critic_count                    NA         NA      NA
## review_stats_critic_total_score              NA         NA      NA
## review_stats_dave_count                      NA         NA      NA
## review_stats_dave_total_score                NA         NA      NA
##                                      Pr(>|z|)    
## (Intercept)                           0.66075    
## latitude                              0.87565    
## longitude                             0.98402    
## price_level                           0.83535    
## provider_rating                       0.65080    
## provider_review_count                 0.95707    
## review_stats_all_average_score        0.53754    
## review_stats_all_count                0.55574    
## review_stats_all_total_score         1.84e-06 ***
## review_stats_community_average_score  0.69185    
## review_stats_community_count          0.54892    
## review_stats_community_total_score   1.25e-06 ***
## review_stats_critic_average_score     0.00141 ** 
## review_stats_critic_count                  NA    
## review_stats_critic_total_score            NA    
## review_stats_dave_count                    NA    
## review_stats_dave_total_score              NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 121.0690  on 171  degrees of freedom
## Residual deviance:   9.4795  on 159  degrees of freedom
## AIC: Inf
## 
## Number of Fisher Scoring iterations: 4

The last model I built had the indepedent variable being price. It is important to look at price and the relationship to different reviews. This model contains an AIC of 1158.

barstool.glm3 <- glm(price_level~. , family = poisson, data = barstool.train)
barstool.glm3
## 
## Call:  glm(formula = price_level ~ ., family = poisson, data = barstool.train)
## 
## Coefficients:
##                          (Intercept)  
##                           -1.524e-01  
##                             latitude  
##                           -3.925e-03  
##                            longitude  
##                           -7.283e-03  
##                      provider_rating  
##                           -1.734e-01  
##                provider_review_count  
##                            1.544e-04  
##       review_stats_all_average_score  
##                            1.184e-01  
##               review_stats_all_count  
##                           -1.237e-01  
##         review_stats_all_total_score  
##                            1.623e-02  
## review_stats_community_average_score  
##                           -2.956e-05  
##         review_stats_community_count  
##                            6.731e-02  
##   review_stats_community_total_score  
##                           -1.009e-02  
##    review_stats_critic_average_score  
##                           -9.887e-03  
##            review_stats_critic_count  
##                                   NA  
##      review_stats_critic_total_score  
##                            1.484e-02  
##      review_stats_dave_average_score  
##                                   NA  
##              review_stats_dave_count  
##                                   NA  
##        review_stats_dave_total_score  
##                                   NA  
## 
## Degrees of Freedom: 462 Total (i.e. Null);  450 Residual
## Null Deviance:       157.5 
## Residual Deviance: 127.6     AIC: 1166

I then compared this model to the testing dataset. The testing dataset contains a lower AIC which shows it would be a better fit for our model.

barstool.glm4 <- glm(price_level~. , family = poisson, data = barstool.test)
barstool.glm4
## 
## Call:  glm(formula = price_level ~ ., family = poisson, data = barstool.test)
## 
## Coefficients:
##                          (Intercept)  
##                           -0.5738806  
##                             latitude  
##                           -0.0005977  
##                            longitude  
##                           -0.0033421  
##                      provider_rating  
##                           -0.0218471  
##                provider_review_count  
##                            0.0001125  
##       review_stats_all_average_score  
##                            0.1098824  
##               review_stats_all_count  
##                            0.2117543  
##         review_stats_all_total_score  
##                            0.0264099  
## review_stats_community_average_score  
##                           -0.0466337  
##         review_stats_community_count  
##                           -0.2659260  
##   review_stats_community_total_score  
##                           -0.0202715  
##    review_stats_critic_average_score  
##                           -0.0468521  
##            review_stats_critic_count  
##                                   NA  
##      review_stats_critic_total_score  
##                                   NA  
##      review_stats_dave_average_score  
##                                   NA  
##              review_stats_dave_count  
##                                   NA  
##        review_stats_dave_total_score  
##                                   NA  
## 
## Degrees of Freedom: 171 Total (i.e. Null);  160 Residual
## Null Deviance:       52.51 
## Residual Deviance: 43.53     AIC: 451.8

Next I examine BIC on the model containing price level as independent variable, which is consistent with the AIC values above.

BIC(barstool.glm3)
## [1] 1219.986
BIC(barstool.glm4)
## [1] 489.5277