Proposal

Through this project, I intend to analyze the restaurants available in NYC to identify how they stack up in terms on quality and health safety. If time permits, I also want to check if there is any co-relation between the quality of these restaurants and their location/neighborhood real estate valuations.

Why I chose this data set

In the current circumstances where there is a public health emergency, restaurants play a keyrole in improving the public health by maintaining proper code and adhering to the standards set by the City. We will also be reviewing if these violations are in any way related to the neighborhood factors i.e., does being in a good neighborhood having high property valulations guarantees access to a better maintained restaurant.

Datasets

To identify this, I intend to use below two datasets,
1. Gives me the restaurant data and their violations
https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j
2. Gives the neighborhood property sales
https://data.cityofnewyork.us/City-Government/DOF-Summary-of-Neighborhood-Sales-by-Neighborhood-/5ebm-myj7

There are totally 402K observations recorded collecting data from 2015 to 2020, with 26 variables per observation.

restaurantRatings <- read.csv("data/Restaurant_ratings.csv",sep = ';', stringsAsFactors = F)
summary(restaurantRatings)
##     ï..CAMIS            DBA                BORO             BUILDING        
##  Min.   :30075445   Length:402061      Length:402061      Length:402061     
##  1st Qu.:41403222   Class :character   Class :character   Class :character  
##  Median :50008729   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :46211177                                                           
##  3rd Qu.:50059343                                                           
##  Max.   :50104547                                                           
##                                                                             
##     STREET            ZIPCODE             PHONE           CUISINE.DESCRIPTION
##  Length:402061      Length:402061      Length:402061      Length:402061      
##  Class :character   Class :character   Class :character   Class :character   
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character   
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##  INSPECTION.DATE       ACTION          VIOLATION.CODE     VIOLATION.DESCRIPTION
##  Length:402061      Length:402061      Length:402061      Length:402061        
##  Class :character   Class :character   Class :character   Class :character     
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character     
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##  CRITICAL.FLAG          SCORE           GRADE            GRADE.DATE       
##  Length:402061      Min.   : -1.00   Length:402061      Length:402061     
##  Class :character   1st Qu.: 11.00   Class :character   Class :character  
##  Mode  :character   Median : 15.00   Mode  :character   Mode  :character  
##                     Mean   : 20.41                                        
##                     3rd Qu.: 26.00                                        
##                     Max.   :164.00                                        
##                     NA's   :16759                                         
##  RECORD.DATE        INSPECTION.TYPE       Latitude       Longitude     
##  Length:402061      Length:402061      Min.   : 0.00   Min.   :-74.25  
##  Class :character   Class :character   1st Qu.:40.69   1st Qu.:-73.99  
##  Mode  :character   Mode  :character   Median :40.73   Median :-73.96  
##                                        Mean   :40.17   Mean   :-72.93  
##                                        3rd Qu.:40.76   3rd Qu.:-73.90  
##                                        Max.   :40.91   Max.   :  0.00  
##                                        NA's   :422     NA's   :422     
##  Community.Board Council.District  Census.Tract         BIN         
##  Min.   :101.0   Min.   : 1.00    Min.   :   100   Min.   :1000000  
##  1st Qu.:105.0   1st Qu.: 4.00    1st Qu.:  7800   1st Qu.:1042861  
##  Median :301.0   Median :20.00    Median : 16200   Median :3007498  
##  Mean   :248.4   Mean   :20.02    Mean   : 28746   Mean   :2507528  
##  3rd Qu.:401.0   3rd Qu.:34.00    3rd Qu.: 40300   3rd Qu.:4001558  
##  Max.   :595.0   Max.   :51.00    Max.   :162100   Max.   :5799501  
##  NA's   :5904    NA's   :5892     NA's   :5892     NA's   :7647     
##       BBL                NTA           
##  Min.   :1.000e+00   Length:402061     
##  1st Qu.:1.010e+09   Class :character  
##  Median :3.002e+09   Mode  :character  
##  Mean   :2.402e+09                     
##  3rd Qu.:4.002e+09                     
##  Max.   :5.270e+09                     
##  NA's   :422
#propertySales <- read.csv("data/Neighborhood_sales.csv",sep = ';', stringsAsFactors = F)
#str(propertySales)

Data clean and wrangling

This data is provided by DOHMH and contains the list of restaurants that were inspected from the year 2014. But primarily the data set supplies good quantitiy of data starting years 2015.
Since the data was acquired from different administrative systems, there were some missing data elements that required a clean up.
For example, some elements such as Boro, Grade were provided incorrectly or absent. And the data set included partial data for 2020, so we had to exclude it to get full year picture.

#Data Cleaning
restaurantRatings <- filter(restaurantRatings, (BORO != '0' & BORO != '') & (GRADE == 'A' | GRADE =='B' | GRADE =='C'))

#Data Wranging
restaurantRatings$hasViolation  = restaurantRatings$ACTION != 'No violations were recorded at the time of this inspection.'
restaurantRatings$inspection.year <- format(as.Date(restaurantRatings$INSPECTION.DATE, format="%m/%d/%Y"),"%Y")
restaurantRatings <- filter(restaurantRatings, (inspection.year > 2015 & inspection.year <2020))

restaurantRatings <- restaurantRatings %>% group_by(restaurantRatings$CUISINE.DESCRIPTION) %>% mutate(count=n())

Type of Cuisines in NYC

Every restaurant is registered with NYC Health Dept under a Cuisine. So here were are trying to understand the demograhy of restaurants in NYC. Below graph displays the top 25 cuisines in NYC

top25Cuisines <- unique(restaurantRatings[,c(8,30)])
top25Cuisines <- top25Cuisines[order(-top25Cuisines$count), ][c(1:25),]

ggplot(top25Cuisines, aes(x=reorder(CUISINE.DESCRIPTION, -count), y=count, label=count)) + geom_bar(stat="identity", position = position_dodge()) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

Using the variable earlier we created in data wrangling ‘inspection.year’ we are looking to find the number of observations made each year across NYC restaurants. This helps us understand how much focus has been emphasized on checking the food establishments.

Inspections by year in NYC

ggplot(data=restaurantRatings, aes(x=inspection.year)) +  geom_bar() + geom_text(stat='count', aes(label=..count..), vjust=-1)

Below chart explains the count of restaurants that had atleast one observation. This helps us understand totally how many restaurants are failing in getting a clean pass

Count of restaurants with violations by each YEAR

restaurantRatingsWithViolations <- filter(restaurantRatings, hasViolation=='TRUE')
restaurantRatingsWithViolations <- unique(restaurantRatingsWithViolations[,c(2,28)])
ggplot(restaurantRatingsWithViolations, aes(x=inspection.year)) + geom_bar() + geom_text(stat='count', aes(label=..count..), vjust=-1)

Whenever an observation is recorded against a restaurant, it’s current GRADE is also registered. So this chart explains what GRADE were the restaurants holding when they saw an observation

Violations by GRADE

violatingRest <- subset(restaurantRatings, restaurantRatings$hasViolation)
ggplot(violatingRest, aes(x=GRADE, fill=inspection.year)) + geom_bar(position = 'dodge')  

Now Let’s focus on recent data of 2019

df2019 <- filter(restaurantRatings, inspection.year=='2019')

GRADES by borough for 2019

Below chart displays the restaurants by GRADES and Boro. This helps us understand which borough has highest number of GRADED restaurants.

ggplot(df2019, aes(x=BORO, fill=GRADE)) + geom_bar(position = 'dodge') + theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

Worst restaurants in 2019 by violations

I tried to collect the worst restaurant/brand by observation recorded in 2019. And suprisingly many chain restaurants appeared in the top list. One possible reason could be due to the number of restaurants registered under same name. But this the count is quite high for a corporate managed restaurant.

df2019 <- df2019 %>% group_by(DBA) %>% mutate(violation.count=n())
worst25Cuisines <- unique(df2019[,c(2,31)])
worst25Cuisines <- worst25Cuisines[order(-worst25Cuisines$violation.count), ][c(1:25),]
worst25Cuisines <- worst25Cuisines[c(1:25),]

ggplot(worst25Cuisines, aes(x=reorder(DBA, -violation.count), y = violation.count, label=violation.count))+ geom_bar(stat="identity", position = position_dodge()) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) +labs( x = "Restaurant", y = "Violation Count")

Word cloud for 2019 violation description

Since each restaurant received an observation description, I tried to collate them and find the most commonly occuring word using wordcloud package.

text <- restaurantRatings$VIOLATION.DESCRIPTION 
text=str_replace_all(text,"[^[:graph:]]", " ") 
text=str_replace_all(text,"andor", " ") 
docs <- Corpus(VectorSource(text))
docs <- docs %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace)
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))
dtm <- TermDocumentMatrix(docs) 
matrix <- as.matrix(dtm) 
words <- sort(rowSums(matrix),decreasing=TRUE) 
words_df <- data.frame(word = names(words),freq=words)
set.seed(1234)
wordcloud(words = words_df$word, freq = words_df$freq, min.freq = 1, max.words=50, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))

Observations

  • Based on the analysis done so far we can arrive at below points
    • ‘American’ classified cusines are the most predominant type of Restaurants in NYC counting to about 40K, followed by Chinese
    • Restaurant inspections have been expanding widely in NYC, resulting in observations reaching a high of 72K in 2019
    • For the year 2019, Manhattan had the highest number of A graded restaurants in NYC
    • Dunkins branded restaurants are the top violators of NYC Health norms and have a total count of about 800, followed by Subway, Starbucks, McDonalds in 2019
    • The most commonly mentioned words in the violation report include flies, vermin, surface, properly etc which explains the kind of observations made at the audit