In the final project, we wanted to explore the relationship between customer review and sanitary rating. We obtained our data from two sources: web page and CSV by utilizing Yelp API for customer reviews and NYC Open Data for inspection scores and grades with location info. We were interested to find out if cleanliness and safety could potentially affect customers’ reviews.
Reference: http://rpubs.com/jasano/inspex Score instruction: https://www1.nyc.gov/assets/doh/downloads/pdf/rii/how-we-score-grade.pdf blog: https://nycdatascience.com/blog/student-works/nyc-restaurants-reviews-and-inspection-scores/
library(RSocrata)
## Warning: package 'RSocrata' was built under R version 3.5.2
library(data.table)
## Warning: package 'data.table' was built under R version 3.5.2
library(httr)
library(keyring)
library(rjson)
library(RJSONIO)
##
## Attaching package: 'RJSONIO'
## The following objects are masked from 'package:rjson':
##
## fromJSON, toJSON
library(jsonlite)
##
## Attaching package: 'jsonlite'
## The following objects are masked from 'package:RJSONIO':
##
## fromJSON, toJSON
## The following objects are masked from 'package:rjson':
##
## fromJSON, toJSON
library(curl)
## Warning: package 'curl' was built under R version 3.5.2
##
## Attaching package: 'curl'
## The following object is masked from 'package:httr':
##
## handle_reset
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.3.2
## ✔ tibble 2.1.1 ✔ dplyr 0.7.8
## ✔ tidyr 0.8.2 ✔ stringr 1.3.1
## ✔ readr 1.3.0 ✔ forcats 0.3.0
## Warning: package 'tibble' was built under R version 3.5.2
## Warning: package 'purrr' was built under R version 3.5.2
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between() masks data.table::between()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::first() masks data.table::first()
## ✖ purrr::flatten() masks jsonlite::flatten()
## ✖ jsonlite::fromJSON() masks RJSONIO::fromJSON(), rjson::fromJSON()
## ✖ curl::handle_reset() masks httr::handle_reset()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::last() masks data.table::last()
## ✖ readr::parse_date() masks curl::parse_date()
## ✖ jsonlite::toJSON() masks RJSONIO::toJSON(), rjson::toJSON()
## ✖ purrr::transpose() masks data.table::transpose()
library(yelpr)
library(stringr)
library(ggmap)
## Warning: package 'ggmap' was built under R version 3.5.2
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
all_df<- read.socrata("https://data.cityofnewyork.us/resource/43nn-pn8j.json")
Glimpse function will give you a snapshot of your dataset.glimpse(all_df)
## Observations: 384,150
## Variables: 18
## $ action <chr> "Violations were cited in the following ar…
## $ boro <chr> "QUEENS", "BROOKLYN", "QUEENS", "MANHATTAN…
## $ building <chr> "13110", "224", "205", "38", "204", "97597…
## $ camis <chr> "50084802", "50056581", "50017217", "41585…
## $ critical_flag <chr> "Critical", "Critical", "Critical", "Not C…
## $ cuisine_description <chr> "Indian", "Cajun", "American", "Thai", "Th…
## $ dba <chr> "NAMASTE", "THE GUMBO BROS", "SHERWOODS KE…
## $ grade <chr> "A", "A", NA, "A", "A", "A", NA, "A", "A",…
## $ grade_date <dttm> 2019-01-10, 2017-11-01, NA, 2018-02-15, 2…
## $ inspection_date <dttm> 2019-01-10, 2017-11-01, 2019-04-03, 2018-…
## $ inspection_type <chr> "Pre-permit (Operational) / Re-inspection"…
## $ phone <chr> "7186746780", "9179091471", "7183810400", …
## $ record_date <dttm> 2019-05-11 06:08:52, 2019-05-11 06:08:52,…
## $ score <chr> "13", "11", "14", "11", "11", "12", "42", …
## $ street <chr> "ROCKAWAY BLVD", "ATLANTIC AVE", "CYPRESS …
## $ violation_code <chr> "02G", "06D", "04C", "09C", "02G", "04H", …
## $ violation_description <chr> "Cold food item held above 41º F (smoked …
## $ zipcode <chr> "11420", "11201", "11385", "10004", "11201…
all_df$grade_date <- as.Date(all_df$grade_date, "%Y-%m-%d")
## Warning in as.POSIXlt.POSIXct(x, tz = tz): unknown timezone '%Y-%m-%d'
all_df$inspection_date <- as.Date(all_df$inspection_date, "%Y-%m-%d")
## Warning in as.POSIXlt.POSIXct(x, tz = tz): unknown timezone '%Y-%m-%d'
all_df$record_date <- as.Date(all_df$record_date, "%Y-%m-%d %h:%m:%s")
## Warning in as.POSIXlt.POSIXct(x, tz = tz): unknown timezone '%Y-%m-%d %h:
## %m:%s'
There were 27050 rows of data and we could only retrieve 5000 at a time, thus we had 6 csv files to contain all the data. These 6 csv files were uploaded to Github. For more detail, please go to this rpub:http://rpubs.com/weizhou2273/495496
yelp1 <- read.csv("https://raw.githubusercontent.com/weizhou2273/data607_final/master/yelp_data/yelp1_5000.csv")
yelp2 <- read.csv("https://raw.githubusercontent.com/weizhou2273/data607_final/master/yelp_data/yelp5001_10000.csv")
yelp3 <- read.csv("https://raw.githubusercontent.com/weizhou2273/data607_final/master/yelp_data/yelp10001_15000.csv")
yelp4 <- read.csv("https://raw.githubusercontent.com/weizhou2273/data607_final/master/yelp_data/yelp15001_19946.csv")
yelp5 <- read.csv("https://raw.githubusercontent.com/weizhou2273/data607_final/master/yelp_data/yelp19947_24946.csv")
yelp6 <- read.csv("https://raw.githubusercontent.com/weizhou2273/data607_final/master/yelp_data/yelp24947_27050.csv")
yelp_df <- as.data.frame(rbind(yelp1, yelp2, yelp3, yelp4, yelp5, yelp6))
glimpse(yelp_df)
## Observations: 27,050
## Variables: 9
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ camis_ls <int> 9435, 20808, 16467, 18587, 10407, 311, 15376, 23021…
## $ phone_ls <fct> +17184748038, +17188506221, +18442253369, +17183596…
## $ id <fct> ELB1RgCv39PBewQx5gmW7Q, lhfgFU9cQIquGaLpSi4r6g, xxd…
## $ rating <dbl> 3.0, 3.5, 4.0, 4.0, 2.5, 3.5, 2.0, 2.5, 4.0, 4.5, 3…
## $ review_count <int> 10, 9, 276, 197, 17, 111, 1, 12, 2375, 31, 440, 4, …
## $ lat <dbl> 40.58489, 40.69349, 40.74693, 40.76206, 40.66953, 4…
## $ lon <dbl> -73.81948, -73.82920, -73.98337, -73.81486, -73.782…
## $ price <fct> $, $$, $$, $$, $$, NA, $, $$, $, $$, $$, $$, $$$, $…
yelp_df$phone_ls <- as.character(yelp_df$phone_ls)
yelp_df$phone_ls <- str_replace_all(yelp_df$phone_ls, "^\\+1", "")
yelp_df <- yelp_df %>% rename(phone = phone_ls)
all_df$phone <- str_trim(all_df$phone, side = "both")
yelp_df$phone <- str_trim(yelp_df$phone, side = "both")
restaurant <- inner_join(all_df, yelp_df)
## Joining, by = "phone"
glimpse(restaurant)
## Observations: 457,999
## Variables: 26
## $ action <chr> "Violations were cited in the following ar…
## $ boro <chr> "QUEENS", "BROOKLYN", "QUEENS", "MANHATTAN…
## $ building <chr> "13110", "224", "205", "38", "204", "97597…
## $ camis <chr> "50084802", "50056581", "50017217", "41585…
## $ critical_flag <chr> "Critical", "Critical", "Critical", "Not C…
## $ cuisine_description <chr> "Indian", "Cajun", "American", "Thai", "Th…
## $ dba <chr> "NAMASTE", "THE GUMBO BROS", "SHERWOODS KE…
## $ grade <chr> "A", "A", NA, "A", "A", "A", NA, "A", "A",…
## $ grade_date <date> 2019-01-10, 2017-11-01, NA, 2018-02-15, 2…
## $ inspection_date <date> 2019-01-10, 2017-11-01, 2019-04-03, 2018-…
## $ inspection_type <chr> "Pre-permit (Operational) / Re-inspection"…
## $ phone <chr> "7186746780", "9179091471", "7183810400", …
## $ record_date <date> 2019-05-11, 2019-05-11, 2019-05-11, 2019-…
## $ score <chr> "13", "11", "14", "11", "11", "12", "42", …
## $ street <chr> "ROCKAWAY BLVD", "ATLANTIC AVE", "CYPRESS …
## $ violation_code <chr> "02G", "06D", "04C", "09C", "02G", "04H", …
## $ violation_description <chr> "Cold food item held above 41º F (smoked …
## $ zipcode <chr> "11420", "11201", "11385", "10004", "11201…
## $ X <int> 2645, 944, 3575, 365, 3135, 3351, 882, 162…
## $ camis_ls <int> 24651, 18187, 13772, 8364, 15898, 485, 109…
## $ id <fct> dG3_Ny7SO0w3vAah0U51RA, VUpzAbKlKblpu2Z1lm…
## $ rating <dbl> 4.0, 4.0, 1.5, 2.5, 3.5, 4.5, 4.0, 3.0, 3.…
## $ review_count <int> 56, 111, 5, 86, 32, 19, 27, 208, 81, 64, 5…
## $ lat <dbl> 40.64161, 40.77113, 40.71103, 40.57540, 40…
## $ lon <dbl> -74.00279, -73.73528, -73.79174, -73.96268…
## $ price <fct> $, $$, $$, $, $$, $$, $$, $$, $$, $$, $$, …
restaurant$grade <- as.factor(restaurant$grade)
restaurant$score <- as.numeric(restaurant$score)
table(restaurant$grade)
##
## A B C G Not Yet Graded
## 204295 25145 8473 5 3013
## P Z
## 2489 3613
204295/(204295+25145+8473+3013+2489+3613)
## [1] 0.8270115
Approximately 83% of restaurants received an inspection grade of A.
summary(restaurant$rating)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 3.500 3.518 4.000 5.000
The average rating is 3.518 and median rating is 3.5 on the scale from 1 to 5.
api_key = "Your Google Map API Key"
# register_google(key = api_key)
#
# nyc_map <- get_map(c(lon = -74.00, lat = 40.71),zoom=12)
#
# ggmap(nyc_map)+
# stat_density2d(data = restaurant, aes(x = lon, y = lat, fill = ..level.., alpha = ..level..),
# geom = "polygon", size = 0.01, bins = 16) +
# scale_fill_gradient(low = "blue", high = "red") +
# scale_alpha(range = c(0, 0.3), guide = FALSE)
Caption for the picture.
restaurant %>%
filter(grade == c("A", "B", "C")) %>%
ggplot(aes(x=grade, y=score, color=grade)) +
geom_boxplot()
## Warning in `==.default`(grade, c("A", "B", "C")): longer object length is
## not a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
Lower violation scores lead to better grade. From the boxplot we saw that there are a few outliners for grade
A restaurants which have high violation scores.
restaurant %>%
filter(grade == c("A", "B", "C")) %>%
ggplot(aes(x=grade, y=rating, color=grade)) +
geom_boxplot()
## Warning in `==.default`(grade, c("A", "B", "C")): longer object length is
## not a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
Interesting, Yelp ratings have almost identical averages and spreads among all grades.
critical VS not critical conditionsrestaurant %>%
filter(grade == c("A", "B", "C"), critical_flag == c("Critical", "Not Critical")) %>%
ggplot(aes(x=grade, y=rating, color=grade)) +
geom_jitter() +
facet_grid(.~critical_flag)
## Warning in `==.default`(grade, c("A", "B", "C")): longer object length is
## not a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in critical_flag == c("Critical", "Not Critical"): longer object
## length is not a multiple of shorter object length
Critical conditions doesn’t seem to affect the ratings of restaurants.
restaurant %>%
filter(grade == c("A", "B", "C"), !price == "NA") %>%
ggplot(aes(x=grade, y=price, color=grade)) +
geom_jitter()
## Warning in `==.default`(grade, c("A", "B", "C")): longer object length is
## not a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
Expensive restaurants seem to have better inspection grades.
Approximately 83% of restaurants received an inspection grade of A.
The average rating of a Yelp customer is 3.518 and median is 3.5 on the scale from 1 to 5.
Bad grades and critical conditions flagged by the Department of Health don’t seem to affect the ratings of restaurants. That being said, a Yelp customer’s review is not likely effected by the cleanliness and safety of a restaurant.