In the final project, we wanted to explore the relationship between customer review and sanitary rating. We obtained our data from two sources: web page and CSV by utilizing Yelp API for customer reviews and NYC Open Data for inspection scores and grades with location info. We were interested to find out if cleanliness and safety could potentially affect customers’ reviews.


Reference: http://rpubs.com/jasano/inspex Score instruction: https://www1.nyc.gov/assets/doh/downloads/pdf/rii/how-we-score-grade.pdf blog: https://nycdatascience.com/blog/student-works/nyc-restaurants-reviews-and-inspection-scores/


Acquiring resturant data from NYC Open Data using API

library(RSocrata)
## Warning: package 'RSocrata' was built under R version 3.5.2
library(data.table)
## Warning: package 'data.table' was built under R version 3.5.2
library(httr)
library(keyring)
library(rjson)
library(RJSONIO)
## 
## Attaching package: 'RJSONIO'
## The following objects are masked from 'package:rjson':
## 
##     fromJSON, toJSON
library(jsonlite)
## 
## Attaching package: 'jsonlite'
## The following objects are masked from 'package:RJSONIO':
## 
##     fromJSON, toJSON
## The following objects are masked from 'package:rjson':
## 
##     fromJSON, toJSON
library(curl)
## Warning: package 'curl' was built under R version 3.5.2
## 
## Attaching package: 'curl'
## The following object is masked from 'package:httr':
## 
##     handle_reset
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0     ✔ purrr   0.3.2
## ✔ tibble  2.1.1     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.3.0     ✔ forcats 0.3.0
## Warning: package 'tibble' was built under R version 3.5.2
## Warning: package 'purrr' was built under R version 3.5.2
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between()     masks data.table::between()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ dplyr::first()       masks data.table::first()
## ✖ purrr::flatten()     masks jsonlite::flatten()
## ✖ jsonlite::fromJSON() masks RJSONIO::fromJSON(), rjson::fromJSON()
## ✖ curl::handle_reset() masks httr::handle_reset()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ dplyr::last()        masks data.table::last()
## ✖ readr::parse_date()  masks curl::parse_date()
## ✖ jsonlite::toJSON()   masks RJSONIO::toJSON(), rjson::toJSON()
## ✖ purrr::transpose()   masks data.table::transpose()
library(yelpr)
library(stringr)
library(ggmap)
## Warning: package 'ggmap' was built under R version 3.5.2
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.

Read in data

Acquire Data using NYC open data API

all_df<- read.socrata("https://data.cityofnewyork.us/resource/43nn-pn8j.json")

It is always good to know what your data looks like. Using Glimpse function will give you a snapshot of your dataset.

glimpse(all_df)
## Observations: 384,150
## Variables: 18
## $ action                <chr> "Violations were cited in the following ar…
## $ boro                  <chr> "QUEENS", "BROOKLYN", "QUEENS", "MANHATTAN…
## $ building              <chr> "13110", "224", "205", "38", "204", "97597…
## $ camis                 <chr> "50084802", "50056581", "50017217", "41585…
## $ critical_flag         <chr> "Critical", "Critical", "Critical", "Not C…
## $ cuisine_description   <chr> "Indian", "Cajun", "American", "Thai", "Th…
## $ dba                   <chr> "NAMASTE", "THE GUMBO BROS", "SHERWOODS KE…
## $ grade                 <chr> "A", "A", NA, "A", "A", "A", NA, "A", "A",…
## $ grade_date            <dttm> 2019-01-10, 2017-11-01, NA, 2018-02-15, 2…
## $ inspection_date       <dttm> 2019-01-10, 2017-11-01, 2019-04-03, 2018-…
## $ inspection_type       <chr> "Pre-permit (Operational) / Re-inspection"…
## $ phone                 <chr> "7186746780", "9179091471", "7183810400", …
## $ record_date           <dttm> 2019-05-11 06:08:52, 2019-05-11 06:08:52,…
## $ score                 <chr> "13", "11", "14", "11", "11", "12", "42", …
## $ street                <chr> "ROCKAWAY BLVD", "ATLANTIC AVE", "CYPRESS …
## $ violation_code        <chr> "02G", "06D", "04C", "09C", "02G", "04H", …
## $ violation_description <chr> "Cold food item held above 41º F (smoked …
## $ zipcode               <chr> "11420", "11201", "11385", "10004", "11201…

Tidy up data

Convert date information to date class

all_df$grade_date <- as.Date(all_df$grade_date, "%Y-%m-%d")
## Warning in as.POSIXlt.POSIXct(x, tz = tz): unknown timezone '%Y-%m-%d'
all_df$inspection_date <- as.Date(all_df$inspection_date, "%Y-%m-%d")
## Warning in as.POSIXlt.POSIXct(x, tz = tz): unknown timezone '%Y-%m-%d'
all_df$record_date <- as.Date(all_df$record_date, "%Y-%m-%d %h:%m:%s")
## Warning in as.POSIXlt.POSIXct(x, tz = tz): unknown timezone '%Y-%m-%d %h:
## %m:%s'

Yelp Fusion API

Acquire Yelp business data.

There were 27050 rows of data and we could only retrieve 5000 at a time, thus we had 6 csv files to contain all the data. These 6 csv files were uploaded to Github. For more detail, please go to this rpub:http://rpubs.com/weizhou2273/495496

Read csv files from github

yelp1 <- read.csv("https://raw.githubusercontent.com/weizhou2273/data607_final/master/yelp_data/yelp1_5000.csv")
yelp2 <- read.csv("https://raw.githubusercontent.com/weizhou2273/data607_final/master/yelp_data/yelp5001_10000.csv")
yelp3 <- read.csv("https://raw.githubusercontent.com/weizhou2273/data607_final/master/yelp_data/yelp10001_15000.csv")
yelp4 <- read.csv("https://raw.githubusercontent.com/weizhou2273/data607_final/master/yelp_data/yelp15001_19946.csv")
yelp5 <- read.csv("https://raw.githubusercontent.com/weizhou2273/data607_final/master/yelp_data/yelp19947_24946.csv")
yelp6 <- read.csv("https://raw.githubusercontent.com/weizhou2273/data607_final/master/yelp_data/yelp24947_27050.csv")

Combine all yelp dataframes as a single dataframe

yelp_df <- as.data.frame(rbind(yelp1, yelp2, yelp3, yelp4, yelp5, yelp6))
glimpse(yelp_df)
## Observations: 27,050
## Variables: 9
## $ X            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ camis_ls     <int> 9435, 20808, 16467, 18587, 10407, 311, 15376, 23021…
## $ phone_ls     <fct> +17184748038, +17188506221, +18442253369, +17183596…
## $ id           <fct> ELB1RgCv39PBewQx5gmW7Q, lhfgFU9cQIquGaLpSi4r6g, xxd…
## $ rating       <dbl> 3.0, 3.5, 4.0, 4.0, 2.5, 3.5, 2.0, 2.5, 4.0, 4.5, 3…
## $ review_count <int> 10, 9, 276, 197, 17, 111, 1, 12, 2375, 31, 440, 4, …
## $ lat          <dbl> 40.58489, 40.69349, 40.74693, 40.76206, 40.66953, 4…
## $ lon          <dbl> -73.81948, -73.82920, -73.98337, -73.81486, -73.782…
## $ price        <fct> $, $$, $$, $$, $$, NA, $, $$, $, $$, $$, $$, $$$, $…
yelp_df$phone_ls <- as.character(yelp_df$phone_ls)
yelp_df$phone_ls <- str_replace_all(yelp_df$phone_ls, "^\\+1", "")
yelp_df <- yelp_df %>% rename(phone = phone_ls)
all_df$phone <- str_trim(all_df$phone, side = "both")
yelp_df$phone <- str_trim(yelp_df$phone, side = "both")

Merge data from two sources into one dataframe

restaurant <- inner_join(all_df, yelp_df)
## Joining, by = "phone"
glimpse(restaurant)
## Observations: 457,999
## Variables: 26
## $ action                <chr> "Violations were cited in the following ar…
## $ boro                  <chr> "QUEENS", "BROOKLYN", "QUEENS", "MANHATTAN…
## $ building              <chr> "13110", "224", "205", "38", "204", "97597…
## $ camis                 <chr> "50084802", "50056581", "50017217", "41585…
## $ critical_flag         <chr> "Critical", "Critical", "Critical", "Not C…
## $ cuisine_description   <chr> "Indian", "Cajun", "American", "Thai", "Th…
## $ dba                   <chr> "NAMASTE", "THE GUMBO BROS", "SHERWOODS KE…
## $ grade                 <chr> "A", "A", NA, "A", "A", "A", NA, "A", "A",…
## $ grade_date            <date> 2019-01-10, 2017-11-01, NA, 2018-02-15, 2…
## $ inspection_date       <date> 2019-01-10, 2017-11-01, 2019-04-03, 2018-…
## $ inspection_type       <chr> "Pre-permit (Operational) / Re-inspection"…
## $ phone                 <chr> "7186746780", "9179091471", "7183810400", …
## $ record_date           <date> 2019-05-11, 2019-05-11, 2019-05-11, 2019-…
## $ score                 <chr> "13", "11", "14", "11", "11", "12", "42", …
## $ street                <chr> "ROCKAWAY BLVD", "ATLANTIC AVE", "CYPRESS …
## $ violation_code        <chr> "02G", "06D", "04C", "09C", "02G", "04H", …
## $ violation_description <chr> "Cold food item held above 41º F (smoked …
## $ zipcode               <chr> "11420", "11201", "11385", "10004", "11201…
## $ X                     <int> 2645, 944, 3575, 365, 3135, 3351, 882, 162…
## $ camis_ls              <int> 24651, 18187, 13772, 8364, 15898, 485, 109…
## $ id                    <fct> dG3_Ny7SO0w3vAah0U51RA, VUpzAbKlKblpu2Z1lm…
## $ rating                <dbl> 4.0, 4.0, 1.5, 2.5, 3.5, 4.5, 4.0, 3.0, 3.…
## $ review_count          <int> 56, 111, 5, 86, 32, 19, 27, 208, 81, 64, 5…
## $ lat                   <dbl> 40.64161, 40.77113, 40.71103, 40.57540, 40…
## $ lon                   <dbl> -74.00279, -73.73528, -73.79174, -73.96268…
## $ price                 <fct> $, $$, $$, $, $$, $$, $$, $$, $$, $$, $$, …
restaurant$grade <- as.factor(restaurant$grade)
restaurant$score <- as.numeric(restaurant$score)

Inspection grades distribution

table(restaurant$grade)
## 
##              A              B              C              G Not Yet Graded 
##         204295          25145           8473              5           3013 
##              P              Z 
##           2489           3613
204295/(204295+25145+8473+3013+2489+3613)
## [1] 0.8270115

Approximately 83% of restaurants received an inspection grade of A.

Statistical summary of Yelp customer rating

summary(restaurant$rating)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   3.500   3.518   4.000   5.000

The average rating is 3.518 and median rating is 3.5 on the scale from 1 to 5.

Visualization

Distribution of restaurant.

Heatmap of Restaurant location: Manhattan Middle town and downtown has the highest density of restaurant.
api_key = "Your Google Map API Key"
# register_google(key = api_key)
# 
# nyc_map <- get_map(c(lon = -74.00, lat = 40.71),zoom=12)
# 
# ggmap(nyc_map)+
# stat_density2d(data = restaurant, aes(x = lon, y = lat, fill = ..level.., alpha = ..level..),
#                geom = "polygon", size = 0.01, bins = 16) +
# scale_fill_gradient(low = "blue", high = "red") +
# scale_alpha(range = c(0, 0.3), guide = FALSE)
Caption for the picture.

Caption for the picture.

Violation score distribution by Inspection grades

restaurant %>% 
  filter(grade == c("A", "B", "C")) %>%
  ggplot(aes(x=grade, y=score, color=grade)) + 
    geom_boxplot()
## Warning in `==.default`(grade, c("A", "B", "C")): longer object length is
## not a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length

Lower violation scores lead to better grade. From the boxplot we saw that there are a few outliners for grade A restaurants which have high violation scores.

Yelp review rating distribution by Inspection grades

restaurant %>% 
  filter(grade == c("A", "B", "C")) %>%
  ggplot(aes(x=grade, y=rating, color=grade)) + 
    geom_boxplot()
## Warning in `==.default`(grade, c("A", "B", "C")): longer object length is
## not a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length

Interesting, Yelp ratings have almost identical averages and spreads among all grades.

Yelp rating: critical VS not critical conditions

restaurant %>% 
  filter(grade == c("A", "B", "C"), critical_flag == c("Critical", "Not Critical")) %>% 
  ggplot(aes(x=grade, y=rating, color=grade)) + 
  geom_jitter() + 
  facet_grid(.~critical_flag)
## Warning in `==.default`(grade, c("A", "B", "C")): longer object length is
## not a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in critical_flag == c("Critical", "Not Critical"): longer object
## length is not a multiple of shorter object length

Critical conditions doesn’t seem to affect the ratings of restaurants.

Price VS Inspection grade

restaurant %>% 
  filter(grade == c("A", "B", "C"), !price == "NA") %>% 
  ggplot(aes(x=grade, y=price, color=grade)) + 
  geom_jitter()
## Warning in `==.default`(grade, c("A", "B", "C")): longer object length is
## not a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length

Expensive restaurants seem to have better inspection grades.


Conclusions

  • Approximately 83% of restaurants received an inspection grade of A.

  • The average rating of a Yelp customer is 3.518 and median is 3.5 on the scale from 1 to 5.

  • Bad grades and critical conditions flagged by the Department of Health don’t seem to affect the ratings of restaurants. That being said, a Yelp customer’s review is not likely effected by the cleanliness and safety of a restaurant.