In the analysis of these 3 datasets, we are addressing average rating, location, and pricing information of various pizza establishments around the US. The goal is to assist the average American consumer in making informed decisions about which pizza restaurants they would like to patronize.
This analysis is based on 3 datasets with different information about pizza restaurants in the US:
Barstool - Contains critic, public and the barstool staff’s rating as well as pricing, location and geolocation of the different pizza places.
Jared - Contains ratings for New York pizza restaurants on a 6-point likert scale.
Datafiniti - Contains price range, location and keywords for pizza restaurants around the US.
Region - Contains state by region from the US Census Bureau
To learn more about the datasets used, click here or view a more detailed explanation in the Data Preparation section below.
Our approach was to divide the analysis to focus on two groups of consumers: 1) those who are price-conscious and 2) those who are interested in finding the pizza place near them with the highest rating.
library(tidyverse) # data manipulation and cleaning
library(knitr) # data dictionary tables
library(kableExtra) # table formatting
library(readxl) # importing dataset descriptions
library(leaflet) # interactive map & geolocation
The pizza data was originally compiled on September 30th, 2019 as a part of a social data project called TidyTuesday. The original intent is for R users to practice their creativity and analytical techniques, specifically with the tidyverse package.
For those who are interested, there’s a 5-minute podcast reviewing the top entries and different visualization techniques for Week 7’s project called “Pizza Party”.
We created the regions dataset based on the definitions from the US Census Bureau.
| variable | class | description |
|---|---|---|
| polla_qid | integer | Quiz ID |
| answer | character | Answer (likert scale) |
| votes | integer | Number of votes for that question/answer combo |
| pollq_id | integer | Poll Question ID |
| question | character | Question |
| place | character | Pizza Place |
| time | integer | Time of quiz |
| total_votes | integer | Total number of votes for that pizza place |
| percent | double | Vote percent of total for that pizza place |
| variable | class | description |
|---|---|---|
| name | character | Pizza place name |
| address1 | character | Pizza place address |
| city | character | City |
| zip | double | Zip |
| country | character | Country |
| latitude | double | Latitude |
| longitude | double | Longitude |
| price_level | double | Price rating (smaller = cheaper) |
| provider_rating | double | Provider review score |
| provider_review_count | double | Provider review count |
| review_stats_all_average_score | double | Average Score |
| review_stats_all_count | double | Count of all reviews |
| review_stats_all_total_score | double | Review total score |
| review_stats_community_average_score | double | Community average score |
| review_stats_community_count | double | community review count |
| review_stats_community_total_score | double | community review total score |
| review_stats_critic_average_score | double | Critic average score |
| review_stats_critic_count | double | Critic review count |
| review_stats_critic_total_score | double | Critic total score |
| review_stats_dave_average_score | double | Dave (Barstool) average score |
| review_stats_dave_count | double | Dave review count |
| review_stats_dave_total_score | double | Dave total score |
| variable | class | description |
|---|---|---|
| polla_qid | integer | Quiz ID |
| answer | character | Answer (likert scale) |
| votes | integer | Number of votes for that question/answer combo |
| pollq_id | integer | Poll Question ID |
| question | character | Question |
| place | character | Pizza Place |
| time | integer | Time of quiz |
| total_votes | integer | Total number of votes for that pizza place |
| percent | double | Vote percent of total for that pizza place |
colSums(is.na(barstool))
## name address1
## 0 0
## city zip
## 0 0
## country latitude
## 0 2
## longitude price_level
## 2 0
## provider_rating provider_review_count
## 0 0
## review_stats_all_average_score review_stats_all_count
## 0 0
## review_stats_all_total_score review_stats_community_average_score
## 0 0
## review_stats_community_count review_stats_community_total_score
## 0 0
## review_stats_critic_average_score review_stats_critic_count
## 0 0
## review_stats_critic_total_score review_stats_dave_average_score
## 0 0
## review_stats_dave_count review_stats_dave_total_score
## 0 0
colSums(is.na(jared))
## polla_qid answer votes pollq_id question place
## 0 0 0 0 0 0
## time total_votes percent
## 0 0 5
colSums(is.na(datafiniti))
## name address city country
## 0 0 0 0
## province latitude longitude categories
## 0 0 0 0
## price_range_min price_range_max
## 0 0
colSums(is.na(region))
## State State Code Region Division
## 0 0 0 0
The jared dataset pizza ratings are on a 1 to 6 likert scale. However, barstool is on a 1-10 scale. We used a mutate function to convert the 1-6 scale to a 1-10 scale:
jared <- jared %>%
mutate(answer = case_when(
.$answer=="Never Again" ~ 0,
.$answer=="Poor" ~ 2,
.$answer=="Fair" ~ 4,
.$answer=="Average" ~ 6,
.$answer=="Good"~ 8,
.$answer=="Excellent" ~ 10))
jared$answer <- as.integer(jared$answer)
str(jared$answer)
## int [1:375] 10 8 6 2 0 10 8 6 2 0 ...
Once the answers are converted to integers, a calculation is done to find the average rating for each pizza place:
jared <- mutate(jared,Weighted_Rating = answer*votes)
(Jared_Average <- jared %>%
group_by(place) %>%
summarise(avg_score = sum(Weighted_Rating)/sum(votes)))
## # A tibble: 56 x 2
## place avg_score
## <chr> <dbl>
## 1 5 Boroughs Pizza 7.33
## 2 Artichoke Basille's Pizza 8
## 3 Arturo's 7.43
## 4 Bella Napoli 7.07
## 5 Ben's of SoHo 14th Street 4.8
## 6 Ben's of SoHo Spring Street 6.44
## 7 Big Slice Pizza 6.27
## 8 Bleecker Street Pizza 8.29
## 9 Bravo Pizza NaN
## 10 Cavallo's Pizza 7.27
## # ... with 46 more rows
(Jared_Mean <- mean(Jared_Average$avg_score, na.rm = T))
## [1] 6.802782
Two columns are dropped for being insignificant:
jared$polla_qid = NULL
jared$time = NULL
head(jared)
This calculation takes the mean of average scores omiting the 0 values
Critic_average <- mean(NA^(barstool$review_stats_critic_average_score == 0)*barstool$review_stats_critic_average_score, na.rm = TRUE)
Daves_average <- mean(NA^(barstool$review_stats_dave_average_score == 0)*barstool$review_stats_dave_average_score, na.rm = TRUE)
Community_average <- mean(NA^(barstool$review_stats_community_average_score == 0)*barstool$review_stats_community_average_score, na.rm = TRUE)
The three calculations below allow to compare to average scores found in the barstool dataset (Critic, Community and Barstool Staff (Dave)) with the scores in the jared dataset.
Represents the values that are in both the barstool and jared datasets
## [1] "Williamsburg Pizza" "Little Italy Pizza"
## [3] "Steve's Pizza" "Girello"
## [5] "5 Boroughs Pizza" "Saluggi's"
## [7] "Artichoke Basille's Pizza" "Bleecker Street Pizza"
## [9] "Joe's Pizza" "Champion Pizza"
## [11] "Prince Street Pizza" "Kiss My Slice"
## [13] "Arturo's" "Vinny Vincenz"
## [15] "Stella's Pizza" "Gotham Pizza"
## [17] "Highline Pizza" "Pizza Italia"
## [19] "Rivoli Pizza" "NY Pizza Suprema"
## [21] "Previti Pizza" "Rocco's Pizza Joint"
Since there weren’t any NA values in the datafiniti dataset, we simply removed duplicate values and changed the name of “province”" to “State Code” to align with the column name in the region dataset.
# Remove duplicates
datafiniti_unique <- unique(datafiniti)
datafiniti_unique # 2,285 unique observations
## # A tibble: 2,285 x 10
## name address city country province latitude longitude categories
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 Shot~ 4203 E~ Sher~ US AR 34.8 -92.2 Pizza,Res~
## 2 Sauc~ 25 E C~ Phoe~ US AZ 33.5 -112. Pizza,Piz~
## 3 Mios~ 3703 P~ Cinc~ US OH 39.1 -84.4 Restauran~
## 4 Hung~ 30495 ~ Madi~ US MI 42.5 -83.1 Pizza,Car~
## 5 Spar~ 3600 E~ Balt~ US MD 39.3 -76.6 Pizza,Ame~
## 6 La V~ 1834 E~ Berk~ US CA 37.9 -122. Pizza Pla~
## 7 Bric~ 4819 K~ Tall~ US FL 30.5 -84.2 Pizza Pla~
## 8 Carr~ 4061 2~ Gran~ US MI 42.9 -85.6 Restauran~
## 9 Dome~ 146 N ~ Glen~ US CA 34.1 -118. Pizza,Res~
## 10 Litt~ 965 N ~ El P~ US TX 31.9 -107. Restauran~
## # ... with 2,275 more rows, and 2 more variables: price_range_min <dbl>,
## # price_range_max <dbl>
# rename province
names(datafiniti_unique)[names(datafiniti_unique) == 'province'] <- 'State Code'
Our next task was to make sense of the categories column, which originally contained values in one column:
head(datafiniti_unique$categories)
## [1] "Pizza,Restaurant,American restaurants,Pizza Place,Restaurants"
## [2] "Pizza,Pizza Place,Restaurants"
## [3] "Restaurant,Pizza Place,Restaurants"
## [4] "Pizza,Carry-out food,Pizza Place,Restaurants"
## [5] "Pizza,American restaurants,Pizza Place,Pizza equipment and supplies,Restaurants"
## [6] "Pizza Place"
We accomplished this by:
## loc_id name
## 1 1 Shotgun Dans Pizza
## 2 2 Sauce Pizza Wine
## 3 3 Mios Pizzeria
## 4 4 Hungry Howies Pizza
## 5 5 Spartan Pizzeria
## 6 6 La Vals
# Separate categories
datafiniti_cat <- datafiniti_unique %>% separate(categories, c("Cat1", "Cat2", "Cat3", "Cat4","Cat5",
"Cat6", "Cat7", "Cat8", "Cat9", "Cat10",
"Cat11"), sep = ",")
## loc_id Cat1 Cat2 Cat3
## 1 1 Pizza Restaurant American restaurants
## 2 2 Pizza Pizza Place Restaurants
## 3 3 Restaurant Pizza Place Restaurants
## 4 4 Pizza Carry-out food Pizza Place
## 5 5 Pizza American restaurants Pizza Place
## 6 6 Pizza Place <NA> <NA>
## Cat4 Cat5 Cat6 Cat7 Cat8 Cat9 Cat10
## 1 Pizza Place Restaurants <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 4 Restaurants <NA> <NA> <NA> <NA> <NA> <NA>
## 5 Pizza equipment and supplies Restaurants <NA> <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## loc_id Category
## 1 1 Pizza
## 2 1 Restaurant
## 3 1 American restaurants
## 4 1 Pizza Place
## 5 1 Restaurants
## 11 2 Pizza
# unique categories w/ count
datafiniti_cat %>%
group_by(Category) %>%
summarize(count = n()) %>%
arrange(desc(count))
## # A tibble: 308 x 2
## Category count
## <chr> <int>
## 1 Pizza Place 2286
## 2 Restaurant 1104
## 3 Pizza 655
## 4 Restaurants 650
## 5 Italian Restaurant 225
## 6 Italian 71
## 7 Caterers 49
## 8 Italian Restaurants 47
## 9 American restaurants 45
## 10 Italian Restaurant and Pizza Place 43
## # ... with 298 more rows
## # A tibble: 312 x 2
## Category count
## <chr> <int>
## 1 Pizza Place 2286
## 2 Restaurant 1104
## 3 Pizza 655
## 4 Restaurants 650
## 5 Italian Restaurant 225
## 6 "Pizza Place " 81
## 7 Italian 71
## 8 " Pizza Place" 53
## 9 Caterers 49
## 10 Italian Restaurants 47
## # ... with 302 more rows
| Clean_cat | count |
|---|---|
| Pizza Place | 3228 |
| Restaurant | 1821 |
| Italian | 462 |
| Other | 153 |
| American | 93 |
| Bar / Pub | 83 |
| Caterer | 75 |
| Pizza Restaurant | 46 |
| Sandwich Place | 37 |
| Delivery / Carry-out | 23 |
| Karaoke | 10 |
| loc_id | Category | Clean_cat |
|---|---|---|
| 1 | Pizza | Pizza Place |
| 1 | Restaurant | Restaurant |
| 1 | American restaurants | American |
| 1 | Pizza Place | Pizza Place |
| 1 | Restaurants | Restaurant |
| 2 | Pizza | Pizza Place |
| 2 | Pizza Place | Pizza Place |
| 2 | Restaurants | Restaurant |
| 3 | Restaurant | Restaurant |
| 3 | Pizza Place | Pizza Place |
# merge w\ new categories & delete original categories
datafiniti_new <- merge(datafiniti_unique, datafiniti_cat, by = 'loc_id')
datafiniti_new <- datafiniti_new %>%
select(-c(categories, Category))
datafiniti_new <- unique(datafiniti_new)
kable(head(select(datafiniti_new, c(loc_id, name, new_cat)), 10)) %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
| loc_id | name | new_cat | |
|---|---|---|---|
| 1 | 1 | Shotgun Dans Pizza | Pizza Place |
| 2 | 1 | Shotgun Dans Pizza | Restaurant |
| 3 | 1 | Shotgun Dans Pizza | American |
| 6 | 2 | Sauce Pizza Wine | Pizza Place |
| 8 | 2 | Sauce Pizza Wine | Restaurant |
| 9 | 3 | Mios Pizzeria | Restaurant |
| 10 | 3 | Mios Pizzeria | Pizza Place |
| 12 | 4 | Hungry Howies Pizza | Pizza Place |
| 13 | 4 | Hungry Howies Pizza | Delivery / Carry-out |
| 15 | 4 | Hungry Howies Pizza | Restaurant |
First, we looked at a brief overview of prices by category:
datafiniti_new %>%
group_by(new_cat) %>%
summarize(Avg_Min_Price = mean(price_range_min), Avg_Max_Price = mean(price_range_max)) %>%
mutate(Range = Avg_Max_Price - Avg_Min_Price) %>%
arrange(Avg_Min_Price)
## # A tibble: 11 x 4
## new_cat Avg_Min_Price Avg_Max_Price Range
## <chr> <dbl> <dbl> <dbl>
## 1 Sandwich Place 2 26.2 24.2
## 2 Pizza Restaurant 2.52 26.6 24.1
## 3 Caterer 4.3 27.6 23.3
## 4 Restaurant 4.41 27.6 23.2
## 5 Delivery / Carry-out 4.55 27.7 23.2
## 6 Pizza Place 4.67 27.8 23.1
## 7 Karaoke 5 28 23
## 8 Other 6.16 28.6 22.4
## 9 American 7.57 29.4 21.8
## 10 Italian 9.74 30.9 21.2
## 11 Bar / Pub 10.8 31.5 20.7
Next, we delved a little deeper into the pricing structure:
# Price Exploration -------------------------------------------------------
# min \ max summary
summary(datafiniti_unique$price_range_min) # 3rd quartile is still 0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 4.67 0.00 50.00
summary(datafiniti_unique$price_range_max) # ranges from 7 to 55
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.0 25.0 25.0 27.8 25.0 55.0
# create spread variable
datafiniti_unique <- datafiniti_unique %>%
mutate(Range = price_range_max - price_range_min)
datafiniti_new <- datafiniti_new %>%
mutate(Range = price_range_max - price_range_min)
summary(datafiniti_unique$Range)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 25.00 25.00 23.13 25.00 34.00
summary(datafiniti_new$Range)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 25.00 25.00 22.95 25.00 34.00
# histogram for each
datafiniti_unique %>%
ggplot(aes(x = price_range_min)) +
geom_histogram(binwidth = 2) +
coord_cartesian(xlim = c(0,40))
datafiniti_unique %>%
ggplot(aes(x = price_range_max)) +
geom_histogram(binwidth = 2) +
coord_cartesian(xlim = c(20,55))
datafiniti_unique %>%
ggplot(aes(x = Range)) +
geom_histogram(binwidth = 2) +
coord_cartesian(xlim = c(10,30))
# boxplot for each
datafiniti_new %>%
ggplot(aes(x = new_cat, y = price_range_min)) +
geom_boxplot() +
coord_flip()
datafiniti_new %>%
ggplot(aes(x = new_cat, y = price_range_max)) +
geom_boxplot() +
coord_flip()
datafiniti_new %>%
ggplot(aes(x = new_cat, y = Range)) +
geom_boxplot() +
coord_flip()
# Price by Region / State ------------------------------------------------------------------
# add Region to datafiniti_unique
datafiniti_unique <-
datafiniti_unique %>%
merge(region, by = "State Code") %>%
select(-(Division))
# Min / max by region
datafiniti_unique %>%
group_by(Region) %>%
summarize(Avg_min_price = mean(price_range_min),
Avg_max_price = mean(price_range_max))
## # A tibble: 4 x 3
## Region Avg_min_price Avg_max_price
## <chr> <dbl> <dbl>
## 1 Midwest 4.35 27.6
## 2 Northeast 3.53 27.0
## 3 South 5.14 28.1
## 4 West 5.75 28.5
# Min / max by state
datafiniti_unique %>%
group_by(State) %>%
summarize(Avg_max_price = mean(price_range_max)) %>%
arrange(desc(Avg_max_price))
## # A tibble: 44 x 2
## State Avg_max_price
## <chr> <dbl>
## 1 Connecticut 40
## 2 South Dakota 32.5
## 3 Iowa 31
## 4 Alabama 30
## 5 Nebraska 30
## 6 Arizona 29.9
## 7 West Virginia 29.3
## 8 Maryland 29.2
## 9 Louisiana 29.1
## 10 Oregon 29.1
## # ... with 34 more rows
datafiniti_unique %>%
group_by(State) %>%
summarize(Avg_min_price = mean(price_range_min)) %>%
arrange(Avg_min_price)
## # A tibble: 44 x 2
## State Avg_min_price
## <chr> <dbl>
## 1 Alaska 0
## 2 Delaware 0
## 3 Hawaii 0
## 4 Montana 0
## 5 New Jersey 0
## 6 North Dakota 0
## 7 Nevada 1.25
## 8 South Carolina 1.85
## 9 Arkansas 1.92
## 10 Kansas 2.08
## # ... with 34 more rows
Locations of the restaurants in the barstool dataset
m <- leaflet(barstool) %>% addTiles()
m %>% addCircles(lng = ~ barstool$longitude, lat = ~ barstool$latitude, popup = barstool$nameL, weight = 8, color = "#fb3004", stroke = TRUE)
This map shows the barstool restaurants which have been reviewed. The circles make it easy to portray a visual of where most reviews are from. For example, zooming in on New York will show hundreds of restaurants.
ggplot(data = datafiniti, aes(x=price_range_max,y=price_range_min)) +
geom_point(color = "blue", size=2,shape=17)
The three calculations below allow to compare to average scores found in the barstool dataset (Critic, Community and Barstool Staff (Dave)) with the scores in the jared dataset.
Pizza_Averages <- c("Dave (Barstool)","Community","Critic","Jared")
y <- c(6.622,7.0846,7.256,6.802)
barplot(y,names.arg = Pizza_Averages,xlab = "Average Review",ylab = "Rating",col = "blue", main = "Pizza Averages")
For the second part of my capstone project I wanted to take a “deep” dive into the data and look at different models, relationships, ratings, and mathematical values of the different datasets.
Scatter plots are a great way of examining two numerical variables in their relationship to each other. The first plot I examined was the barstool dataset which included price level and the provider ratings. It is interesting to see that as price increases, not all the ratings do. However, all ratings are in between are a 3 or 4 out of a 1-5 scale. More expensive pizza eliminates any one or two star ratings, but does not gurantee the best pizza.
plot(barstool$price_level, barstool$provider_rating, xlab = "Price", ylab = "Rating")
Modeling is extremely useful in understanding different variable relationships, however simple functions can still show valuable information. I would first like to show the mean from the averages in the community, Dave, critic reviews, and all average scores.
barstool_table <- matrix(c('6.45729','6.876439','6.622721','0.9716847'),ncol = 4, byrow = TRUE)
colnames(barstool_table) <- c('Community','All','Dave','Critic')
barstool_table
## Community All Dave Critic
## [1,] "6.45729" "6.876439" "6.622721" "0.9716847"
After reviewing the table, the critic average score is extremely low. This is due to a large number of NA replies from the same restaurants. Perhaps a histogram would be more beneficial to see the distribution. Once the histogram is shown, if you ignore the 0s, most responses are in the 6-7 range, consistent with the other reviewers.
hist(barstool$review_stats_critic_average_score)
I next want to go ahead and build a GLM model. To first do this, I want to check all my continuous variables.
library(dplyr)
continuous <- select_if(barstool,is.numeric)
summary(continuous)
## zip latitude longitude price_level
## Min. : 1748 Min. :25.79 Min. :-122.41 Min. :0.00
## 1st Qu.:10009 1st Qu.:40.72 1st Qu.: -74.09 1st Qu.:1.00
## Median :10019 Median :40.75 Median : -73.99 Median :1.00
## Mean :18531 Mean :40.19 Mean : -77.44 Mean :1.46
## 3rd Qu.:11234 3rd Qu.:40.78 3rd Qu.: -73.97 3rd Qu.:2.00
## Max. :94133 Max. :45.00 Max. : -70.09 Max. :3.00
## NA's :2 NA's :2
## provider_rating provider_review_count review_stats_all_average_score
## Min. :2.000 Min. : 2.0 Min. :0.100
## 1st Qu.:3.500 1st Qu.: 74.0 1st Qu.:6.240
## Median :3.500 Median : 169.0 Median :7.162
## Mean :3.671 Mean : 386.1 Mean :6.876
## 3rd Qu.:4.000 3rd Qu.: 392.0 3rd Qu.:7.809
## Max. :5.000 Max. :5797.0 Max. :9.079
##
## review_stats_all_count review_stats_all_total_score
## Min. : 1.00 Min. : 0.10
## 1st Qu.: 4.00 1st Qu.: 23.65
## Median : 8.00 Median : 54.10
## Mean : 19.02 Mean : 149.93
## 3rd Qu.: 19.00 3rd Qu.: 140.20
## Max. :568.00 Max. :5045.60
##
## review_stats_community_average_score review_stats_community_count
## Min. : 0.000 Min. : 0.00
## 1st Qu.: 6.075 1st Qu.: 3.00
## Median : 7.225 Median : 7.00
## Mean : 6.457 Mean : 17.87
## 3rd Qu.: 7.873 3rd Qu.: 18.00
## Max. :10.000 Max. :567.00
##
## review_stats_community_total_score review_stats_critic_average_score
## Min. : 0.00 Min. : 0.0000
## 1st Qu.: 15.65 1st Qu.: 0.0000
## Median : 47.30 Median : 0.0000
## Mean : 142.28 Mean : 0.9717
## 3rd Qu.: 135.10 3rd Qu.: 0.0000
## Max. :5036.30 Max. :11.0000
##
## review_stats_critic_count review_stats_critic_total_score
## Min. :0.0000 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.: 0.000
## Median :0.0000 Median : 0.000
## Mean :0.1425 Mean : 1.023
## 3rd Qu.:0.0000 3rd Qu.: 0.000
## Max. :5.0000 Max. :29.800
##
## review_stats_dave_average_score review_stats_dave_count
## Min. : 0.080 Min. :1
## 1st Qu.: 6.200 1st Qu.:1
## Median : 7.100 Median :1
## Mean : 6.623 Mean :1
## 3rd Qu.: 7.800 3rd Qu.:1
## Max. :10.000 Max. :1
##
## review_stats_dave_total_score
## Min. : 0.080
## 1st Qu.: 6.200
## Median : 7.100
## Mean : 6.623
## 3rd Qu.: 7.800
## Max. :10.000
##
Next, I want to go ahead and check all factor variables in my barstool dataset.
factor <- data.frame(select_if(barstool, is.factor))
ncol(factor)
## [1] 0
The barstool dataset also contained a variety of different different reviews, which contained providers, crtics, the reviewer named “Dave,” etc. I next wanted to build a model to see how the price compared to all these different reviews. I built a logistic regression model which split my data to training (70%) and testing (30%). To make things easier, I created a barstool2 dataset in which I eliminated all the character variables.
barstool2 <- read_csv("pizza_barstool2.csv")
## Parsed with column specification:
## cols(
## latitude = col_double(),
## longitude = col_double(),
## price_level = col_double(),
## provider_rating = col_double(),
## provider_review_count = col_double(),
## review_stats_all_average_score = col_double(),
## review_stats_all_count = col_double(),
## review_stats_all_total_score = col_double(),
## review_stats_community_average_score = col_double(),
## review_stats_community_count = col_double(),
## review_stats_community_total_score = col_double(),
## review_stats_critic_average_score = col_double(),
## review_stats_critic_count = col_double(),
## review_stats_critic_total_score = col_double(),
## review_stats_dave_average_score = col_double(),
## review_stats_dave_count = col_double(),
## review_stats_dave_total_score = col_double()
## )
nrow(barstool2)
## [1] 463
index <- sample(nrow(barstool2),463,nrow(barstool2)*70)
barstool.train <- barstool2[index,]
barstool.test <- barstool2[-index,]
Next, I am going to train a logistic regression model with a poisson distribution. I am prediciting an outcome variable with count data and review data. I first used Daves score as my independent variable to see the relationship it had with different reviewers.
barstool.glm0 <- glm(review_stats_dave_average_score~. , family = poisson, data = barstool.train)
summary(barstool.glm0)
##
## Call:
## glm(formula = review_stats_dave_average_score ~ ., family = poisson,
## data = barstool.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.66630 -0.07488 0.03486 0.10274 0.28923
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error z value
## (Intercept) 3.730e-01 4.134e-01 0.902
## latitude -7.349e-05 4.927e-03 -0.015
## longitude -6.018e-04 2.051e-03 -0.293
## price_level -7.953e-03 3.228e-02 -0.246
## provider_rating -2.200e-03 4.211e-02 -0.052
## provider_review_count 5.180e-06 3.181e-05 0.163
## review_stats_all_average_score 1.674e-02 3.090e-02 0.542
## review_stats_all_count 1.073e-01 2.563e-01 0.419
## review_stats_all_total_score 1.897e-01 2.087e-02 9.089
## review_stats_community_average_score -4.597e-03 1.097e-02 -0.419
## review_stats_community_count -1.050e-01 2.552e-01 -0.411
## review_stats_community_total_score -1.901e-01 2.079e-02 -9.143
## review_stats_critic_average_score 6.009e-03 2.228e-02 0.270
## review_stats_critic_count NA NA NA
## review_stats_critic_total_score -2.069e-01 5.216e-02 -3.966
## review_stats_dave_count NA NA NA
## review_stats_dave_total_score NA NA NA
## Pr(>|z|)
## (Intercept) 0.367
## latitude 0.988
## longitude 0.769
## price_level 0.805
## provider_rating 0.958
## provider_review_count 0.871
## review_stats_all_average_score 0.588
## review_stats_all_count 0.676
## review_stats_all_total_score < 2e-16 ***
## review_stats_community_average_score 0.675
## review_stats_community_count 0.681
## review_stats_community_total_score < 2e-16 ***
## review_stats_critic_average_score 0.787
## review_stats_critic_count NA
## review_stats_critic_total_score 7.32e-05 ***
## review_stats_dave_count NA
## review_stats_dave_total_score NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 275.445 on 462 degrees of freedom
## Residual deviance: 23.508 on 449 degrees of freedom
## AIC: Inf
##
## Number of Fisher Scoring iterations: 4
I then built another model examining the relationship to the testing dataset.
barstool.glm1 <- glm(review_stats_dave_average_score~. , family = poisson, data = barstool.test)
summary(barstool.glm1)
##
## Call:
## glm(formula = review_stats_dave_average_score ~ ., family = poisson,
## data = barstool.test)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.35941 -0.09083 0.02952 0.13309 0.30522
##
## Coefficients: (4 not defined because of singularities)
## Estimate Std. Error z value
## (Intercept) 2.821e-01 6.427e-01 0.439
## latitude -1.267e-03 8.099e-03 -0.156
## longitude 5.723e-05 2.858e-03 0.020
## price_level -1.198e-02 5.766e-02 -0.208
## provider_rating -2.895e-02 6.396e-02 -0.453
## provider_review_count 2.894e-06 5.375e-05 0.054
## review_stats_all_average_score 3.882e-02 6.297e-02 0.617
## review_stats_all_count 2.521e-01 4.278e-01 0.589
## review_stats_all_total_score 1.937e-01 4.061e-02 4.771
## review_stats_community_average_score -9.409e-03 2.374e-02 -0.396
## review_stats_community_count -2.574e-01 4.295e-01 -0.599
## review_stats_community_total_score -1.932e-01 3.986e-02 -4.847
## review_stats_critic_average_score -2.233e-01 6.994e-02 -3.192
## review_stats_critic_count NA NA NA
## review_stats_critic_total_score NA NA NA
## review_stats_dave_count NA NA NA
## review_stats_dave_total_score NA NA NA
## Pr(>|z|)
## (Intercept) 0.66075
## latitude 0.87565
## longitude 0.98402
## price_level 0.83535
## provider_rating 0.65080
## provider_review_count 0.95707
## review_stats_all_average_score 0.53754
## review_stats_all_count 0.55574
## review_stats_all_total_score 1.84e-06 ***
## review_stats_community_average_score 0.69185
## review_stats_community_count 0.54892
## review_stats_community_total_score 1.25e-06 ***
## review_stats_critic_average_score 0.00141 **
## review_stats_critic_count NA
## review_stats_critic_total_score NA
## review_stats_dave_count NA
## review_stats_dave_total_score NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 121.0690 on 171 degrees of freedom
## Residual deviance: 9.4795 on 159 degrees of freedom
## AIC: Inf
##
## Number of Fisher Scoring iterations: 4
The last model I built had the indepedent variable being price. It is important to look at price and the relationship to different reviews. This model contains an AIC of 1158.
barstool.glm3 <- glm(price_level~. , family = poisson, data = barstool.train)
barstool.glm3
##
## Call: glm(formula = price_level ~ ., family = poisson, data = barstool.train)
##
## Coefficients:
## (Intercept)
## -1.524e-01
## latitude
## -3.925e-03
## longitude
## -7.283e-03
## provider_rating
## -1.734e-01
## provider_review_count
## 1.544e-04
## review_stats_all_average_score
## 1.184e-01
## review_stats_all_count
## -1.237e-01
## review_stats_all_total_score
## 1.623e-02
## review_stats_community_average_score
## -2.956e-05
## review_stats_community_count
## 6.731e-02
## review_stats_community_total_score
## -1.009e-02
## review_stats_critic_average_score
## -9.887e-03
## review_stats_critic_count
## NA
## review_stats_critic_total_score
## 1.484e-02
## review_stats_dave_average_score
## NA
## review_stats_dave_count
## NA
## review_stats_dave_total_score
## NA
##
## Degrees of Freedom: 462 Total (i.e. Null); 450 Residual
## Null Deviance: 157.5
## Residual Deviance: 127.6 AIC: 1166
I then compared this model to the testing dataset. The testing dataset contains a lower AIC which shows it would be a better fit for our model.
barstool.glm4 <- glm(price_level~. , family = poisson, data = barstool.test)
barstool.glm4
##
## Call: glm(formula = price_level ~ ., family = poisson, data = barstool.test)
##
## Coefficients:
## (Intercept)
## -0.5738806
## latitude
## -0.0005977
## longitude
## -0.0033421
## provider_rating
## -0.0218471
## provider_review_count
## 0.0001125
## review_stats_all_average_score
## 0.1098824
## review_stats_all_count
## 0.2117543
## review_stats_all_total_score
## 0.0264099
## review_stats_community_average_score
## -0.0466337
## review_stats_community_count
## -0.2659260
## review_stats_community_total_score
## -0.0202715
## review_stats_critic_average_score
## -0.0468521
## review_stats_critic_count
## NA
## review_stats_critic_total_score
## NA
## review_stats_dave_average_score
## NA
## review_stats_dave_count
## NA
## review_stats_dave_total_score
## NA
##
## Degrees of Freedom: 171 Total (i.e. Null); 160 Residual
## Null Deviance: 52.51
## Residual Deviance: 43.53 AIC: 451.8
Next I examine BIC on the model containing price level as independent variable, which is consistent with the AIC values above.
BIC(barstool.glm3)
## [1] 1219.986
BIC(barstool.glm4)
## [1] 489.5277