Beware Before You Take a Bite out of the Big Apple

An exploration of New York City restaurant inspection data

Jaclyn Janis

MPH 676, University of Southern Maine, Fall 2018

Purpose

Let it be known that I have never really loved New York City. I’ve never owned “I <3 NY” paraphernalia, I don’t dream of weekends out on the town, and Times Square makes me a tad nauseated. In fact, there was a period of time when every visit to NYC brought me a different physical ailment - I called it my “New York curse.” So this exploration of NYC restaurant inspection data is not so much an ode to a city, nor is it revenge for all my ailments of yesteryear. I chose this dataset because I discovered that NYC has a plethora of publicly available datasets that carry zip codes - and that is really great for assessing all kinds of information geographically. Best of all, everything seems to be downloadable as a .csv (I went down a rabbit hole trying to use CDC data that I could just not read into R - a quest for another day). NYC Open Data has a wealth of information, from baby names to rodent inspections to mortality. The datasets are well organized, and the interface is user-friendly. Maybe New York is the center of the universe.

In this assignment, I am specifically exploring restaurant inspections that resulted in critical violations. The dictionary for this dataset described critical violations as those that would “most likely contribute to foodborne illness.” One description of a critical violation reads, “Filth flies or food/refuse/sewage-associated (FRSA) flies present in facility’s food and/or non-food areas.” Gross. Critical violations seem like a significant measure for a Department of Health to monitor.

Preparing the Data

First, I downloaded the NYC zip code data, also from NYC Open Data. I wanted to make sure I could read in the spatial dataset without issue. I am suppressing this code in order to keep my API private. I loaded nearly all my packages in this step then did readOGR().

I checked that my zip code variable was, in fact, named “zipcode.”

names(nyc_zip)
##  [1] "ZIPCODE"    "BLDGZIP"    "PO_NAME"    "POPULATION" "AREA"      
##  [6] "STATE"      "COUNTY"     "ST_FIPS"    "CTY_FIPS"   "URL"       
## [11] "SHAPE_AREA" "SHAPE_LEN"

I used the same code as in Unit 8 to ensure I had working spatial data for my zip codes.

nyc_zip <- spTransform(nyc_zip, CRS("+proj=longlat +datum=WGS84"))

leaflet() %>% 
  addProviderTiles("CartoDB.Positron") %>% 
  addPolygons(data = nyc_zip,
              popup = ~ZIPCODE)

I downloaded the DOHMH New York City Restaurant Inspection Results and got started creating my desired variable, the rate of restaurant inspection critical violations by zip code. I used fread() to read in my csv because otherwise, I was only reading in 22 of 381,257 observations. The original dataset gave me a critical flag of “Critical,” “Not Critical,” or “Not Applicable” (based on the type of inspection performed).

library(data.table)
nyc_insp <- fread("nyc_inspection.csv")
head(nyc_insp)

I converted the information in “CRITICAL FLAG” to 1s and 0s for easy accounting. I created two variables: critical_rate (number of critical flags by zip code divided by number of inspections in zip code and multiplied by 100 to be able to discuss percents) and n, the number of restaurants in the zip code that had inspections. I ran into several issues dealing with the original variable names that had spaces in them (flashback to the beginning of the semester). After some guidance from Prof. Suleiman, I renamed those variables and was then able to manipulate the data without issue.

nyc_inspection <- nyc_insp %>% select(ZIPCODE, cuisine_desc =`CUISINE DESCRIPTION`, violation_desc =`VIOLATION DESCRIPTION`, critical_flag =`CRITICAL FLAG`) %>% 
  mutate(critical = ifelse(critical_flag == "Critical", 1, 0)) %>% group_by(ZIPCODE) %>% summarize(critical_rate = as.integer(sum(critical)/n()*100), n = n(), n_critical = sum(critical))
nyc_inspection

I merged my new dataframe, insp_zip, with my spatial dataset and checked names() to make sure everything was there.

insp_zip <- merge(nyc_zip, nyc_inspection, by.x = 'ZIPCODE', by.y = 'ZIPCODE')
names(insp_zip)
##  [1] "ZIPCODE"       "BLDGZIP"       "PO_NAME"       "POPULATION"   
##  [5] "AREA"          "STATE"         "COUNTY"        "ST_FIPS"      
##  [9] "CTY_FIPS"      "URL"           "SHAPE_AREA"    "SHAPE_LEN"    
## [13] "critical_rate" "n"             "n_critical"

I took a look at the summary statistics so I knew what to expect on my choropleth. At this quick glance, I could almost say very generally that about half of NYC restaurant inspections result in critical violations.

summary(insp_zip$critical_rate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   52.00   54.00   52.23   55.00   66.00      32

Exploring the Data

My first choropleth displays the percent of restaurant inspections with critical violations by zip code in New York City. As you explore the popups, notice that the difference in percents from the lightest to darkest red is not that large, as most of the data fall between 40% and 60%. The color palette is picking up pretty small differences, but you might consider how many restaurants large percentages correspond to in the most densely populated parts of the city. I added the number of restaurants inspected and number of critical violations on the popup so you can get a sense.

zip_popup <- paste("<strong>Zip code: </strong>",insp_zip$ZIPCODE,
                   "<br><strong>Neighborhood: </strong>", insp_zip$PO_NAME,
                   "<br><strong>Percent of Inspections with Critical Violations: </strong>", insp_zip$critical_rate,
                   "<br><strong>Number of Inspections with Critical Violations: </strong>", insp_zip$n_critical,
                   "<br><strong>Number of Inspections: </strong>", insp_zip$n)
pal <- colorQuantile("Reds", NULL, n = 4)

leaflet() %>%
  addProviderTiles("CartoDB.Positron") %>%
  addPolygons(data = insp_zip,
              fillColor = ~pal(critical_rate),
              color = "#BDBDC3", 
              weight = 2,
              fillOpacity = 1,
              popup=~zip_popup)

Just out of curiosity, I went ahead and explored the raw number of critical violations by zip code. Sure enough, there seems to be a more apparent geographical pattern here, which probably corresponds to population and/or restaurant density with the highest numbers toward Manhattan (labeled “New York” on popup by default from the dataset) and into Brooklyn. I also just wanted to try out the color orange.

zip_popup2 <- paste("<strong>Zip code: </strong>",insp_zip$ZIPCODE,
                   "<br><strong>Neighborhood: </strong>", insp_zip$PO_NAME,
                   "<br><strong>Number of Inspections with Critical Violations: </strong>", insp_zip$n_critical,
                   "<br><strong>Percent of Inspections with Critical Violations: </strong>", insp_zip$critical_rate,
                   "<br><strong>Number of Inspections: </strong>", insp_zip$n)
pal2 <- colorQuantile("Oranges", NULL, n = 4)

leaflet() %>% 
  addProviderTiles("CartoDB.Positron") %>% 
  addPolygons(data = insp_zip,
              fillColor = ~pal2(n_critical),
              color = "#BDBDC3", 
              weight = 2,
              fillOpacity = 1,
              popup=~zip_popup2)

Discussion

These two choropleths demonstrate why it is important to consider rates of critical violations over raw numbers of critical violations, as many more inspections occur in the densest parts of the city. This assignment has given me even more appreciation for how much information can be communicated with one visualization, particularly via what appears on the popup in order to capture nuance in bigger geographic trends.

The above information may be useful to the Department of Health, as it indicates what neighborhoods require more resources to perform, handle, and follow up with restaurant inspections.

While these data don’t make me any more excited to visit NYC than I already was, I continue to be so impressed by the transparency that NYC Open Data offers. It almost makes me say:

Source