2015-2024 Food Inspections in NYC

Author

Bryan Argueta

Introduction

The data set provides information about food inspections in restaurants located within New York City. With this data set, I plan to investigate if there is a significant linear relationship between the month of the year and the inspection score. Additionally, I want to look for any patterns in the different letter grades given to restaurant from 2015 to 2024. I specifically want to look to see if there are any noticeable changes before, during, and after the pandemic. I chose this topic and data set because it is personal to me. I enjoy going out to different restaurants alongside either family or friends to try new foods. For that reason, it is important to me that family, friends, and myself are consuming food that has been properly prepared. Exploring this data set can help me to get a better insight about food safety.

Variables I will use:

- inspectiondate: The date that the inspection occurred in month, day, year format.

- score: The numerical score that the restaurant received. The lower the score the better.

- grade: The letter grade that the restaurant received. The highest possible score is an A.

Background Research

Restaurant inspections are a common thing in the U.S. The purpose of these inspections is generally to ensure that restaurants are following food and safety guidelines. These not only benefit the customers by providing them with safe places to eat, but also the restaurant itself to avoid harming their customers. To understand how it is these inspections achieve their goal, it is important to know what inspectors look for when conducting investigations. According to the National Restaurant Association, the areas of interests for inspectors include:

Food storage to avoid cross-contamination which is the leading cause for foodborne illnesses
Cooking times & temperatures to prevent harmful bacteria growth
Staff personal hygiene to reduce the spread of pathogens
Cleaning & Sanitation to further prevent harmful bacteria growth

Load libraries

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(RColorBrewer)
library(lubridate)
library(highcharter)

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

Load the data

setwd("/Users/bryana/Documents/Data110/Datasets")
inspections <- read_csv("nycRestaurantInspections.csv")

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 223638 Columns: 27
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (18): DBA, BORO, BUILDING, STREET, PHONE, CUISINE DESCRIPTION, INSPECTIO...
dbl  (8): CAMIS, ZIPCODE, SCORE, Latitude, Longitude, Community Board, BIN, BBL
lgl  (1): Location Point1

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(inspections)

# A tibble: 6 × 27
     CAMIS DBA         BORO  BUILDING STREET ZIPCODE PHONE `CUISINE DESCRIPTION`
     <dbl> <chr>       <chr> <chr>    <chr>    <dbl> <chr> <chr>                
1 50140083 POPEYES LO… Quee… 3174     STEIN…   11103 3478… <NA>                 
2 50148769 NEW SUKI I… Manh… 1694     2 AVE…   10128 2123… <NA>                 
3 50077411 MISTER CHI… Quee… 4464     21ST …   11101 3473… Chicken              
4 50138208 ASIA RICE … Broo… 694      MANHA…   11222 6319… <NA>                 
5 50143838 FAIRFIELD … Quee… 14818    ARCHE…   11435 9177… <NA>                 
6 50106568 MARGARITAS… Quee… 2902     FRANC…   11358 5164… <NA>                 
# ℹ 19 more variables: `INSPECTION DATE` <chr>, ACTION <chr>,
#   `VIOLATION CODE` <chr>, `VIOLATION DESCRIPTION` <chr>,
#   `CRITICAL FLAG` <chr>, SCORE <dbl>, GRADE <chr>, `GRADE DATE` <chr>,
#   `RECORD DATE` <chr>, `INSPECTION TYPE` <chr>, Latitude <dbl>,
#   Longitude <dbl>, `Community Board` <dbl>, `Council District` <chr>,
#   `Census Tract` <chr>, BIN <dbl>, BBL <dbl>, NTA <chr>,
#   `Location Point1` <lgl>

Clean the data:

Lowercase and remove the spaces in the header names

names(inspections) <- tolower(names(inspections))
names(inspections) <- gsub(" ","",names(inspections))
head(inspections)

# A tibble: 6 × 27
     camis dba            boro  building street zipcode phone cuisinedescription
     <dbl> <chr>          <chr> <chr>    <chr>    <dbl> <chr> <chr>             
1 50140083 POPEYES LOUIS… Quee… 3174     STEIN…   11103 3478… <NA>              
2 50148769 NEW SUKI ICHI… Manh… 1694     2 AVE…   10128 2123… <NA>              
3 50077411 MISTER CHICKE… Quee… 4464     21ST …   11101 3473… Chicken           
4 50138208 ASIA RICE NOO… Broo… 694      MANHA…   11222 6319… <NA>              
5 50143838 FAIRFIELD INN… Quee… 14818    ARCHE…   11435 9177… <NA>              
6 50106568 MARGARITAS CA… Quee… 2902     FRANC…   11358 5164… <NA>              
# ℹ 19 more variables: inspectiondate <chr>, action <chr>, violationcode <chr>,
#   violationdescription <chr>, criticalflag <chr>, score <dbl>, grade <chr>,
#   gradedate <chr>, recorddate <chr>, inspectiontype <chr>, latitude <dbl>,
#   longitude <dbl>, communityboard <dbl>, councildistrict <chr>,
#   censustract <chr>, bin <dbl>, bbl <dbl>, nta <chr>, locationpoint1 <lgl>

Selecting only some headers

I want to focus on the restaurants, the inspection score and grade they received, also if they violated any regulations.

inspections2 <- inspections |>
  select(dba, inspectiondate, violationdescription, criticalflag, score, grade) |>
  group_by(dba, inspectiondate, violationdescription, criticalflag, score, grade)

head(inspections2)

# A tibble: 6 × 6
# Groups:   dba, inspectiondate, violationdescription, criticalflag, score,
#   grade [6]
  dba               inspectiondate violationdescription criticalflag score grade
  <chr>             <chr>          <chr>                <chr>        <dbl> <chr>
1 POPEYES LOUISIAN… 01/01/1900     <NA>                 Not Applica…    NA <NA> 
2 NEW SUKI ICHIRO   01/01/1900     <NA>                 Not Applica…    NA <NA> 
3 MISTER CHICKEN II 02/22/2022     Hot food item not h… Critical        32 <NA> 
4 ASIA RICE NOODLES 01/01/1900     <NA>                 Not Applica…    NA <NA> 
5 FAIRFIELD INN & … 01/01/1900     <NA>                 Not Applica…    NA <NA> 
6 MARGARITAS CAFE   01/01/1900     <NA>                 Not Applica…    NA <NA>

Removing certain values

The restaurants with the inspection date being 1/1/1900 need to be removed because that date means that they have not been inspected yet.

inspections_clean <- inspections2 |>
    filter(inspectiondate != "01/01/1900")

head(inspections_clean)

# A tibble: 6 × 6
# Groups:   dba, inspectiondate, violationdescription, criticalflag, score,
#   grade [6]
  dba               inspectiondate violationdescription criticalflag score grade
  <chr>             <chr>          <chr>                <chr>        <dbl> <chr>
1 MISTER CHICKEN II 02/22/2022     Hot food item not h… Critical        32 <NA> 
2 SOMTUM DER        12/06/2023     Thawing procedure i… Not Critical     7 A    
3 FU LAI O KITCHEN  01/11/2023     Thawing procedure i… Not Critical    11 A    
4 TACOS AL SUADERO  06/12/2023     Live roaches in fac… Critical        26 <NA> 
5 KRISPY KREME DOU… 04/14/2023     <NA>                 Not Applica…     0 Z    
6 LUCKY STAR        12/26/2023     Thawing procedure i… Not Critical    25 <NA>

Remove any NA’s found in grade & score. Additionally, only focus on grades A, B, C

I decided to exclude grade Z because it means the grade is pending and I similarly excluded grade p because it means that the grade is pending issued on re-opening following an initial inspection that resulted in a closure. Both are things that won’t be necessary for my investigations

inspections_clean2 <- inspections_clean |>
  filter(!is.na(grade)) |>
  filter(!is.na(score)) |>
  filter(grade == "A" | grade == "B" | grade == "C")

Linear regression analysis:

I want to check if the inspection score increase with the months of the year increasing to see if there’s any improvement in inspections as the year goes on.

Converting the date/time format to only months in separate column

inspections_clean2$month <- month(as.POSIXct(inspections_clean2$inspectiondate, format = "%m/%d/%Y"))

Renaming months and reordering them

inspections_clean2$month[inspections_clean2$month == 1]<- "January"
inspections_clean2$month[inspections_clean2$month == 2]<- "February"
inspections_clean2$month[inspections_clean2$month == 3]<- "March"
inspections_clean2$month[inspections_clean2$month == 4]<- "April"
inspections_clean2$month[inspections_clean2$month == 5]<- "May"
inspections_clean2$month[inspections_clean2$month == 6]<- "June"
inspections_clean2$month[inspections_clean2$month == 7]<- "July"
inspections_clean2$month[inspections_clean2$month == 8]<- "August"
inspections_clean2$month[inspections_clean2$month == 9]<- "September"
inspections_clean2$month[inspections_clean2$month == 10]<- "October"
inspections_clean2$month[inspections_clean2$month == 11]<- "November"
inspections_clean2$month[inspections_clean2$month == 12]<- "December"

# Reorder the months 
inspections_clean2$month<-factor(inspections_clean2$month, levels=c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))

Linear regression model using months of the year to predict the inspection score

model <- lm(score ~ month, data = inspections_clean2)

summary(model)


Call:
lm(formula = score ~ month, data = inspections_clean2)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.895  -4.924  -2.307   0.283  89.283 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    15.30675    0.09748 157.026  < 2e-16 ***
monthFebruary  -0.38234    0.13985  -2.734 0.006260 ** 
monthMarch     -2.58965    0.13472 -19.223  < 2e-16 ***
monthApril     -2.93347    0.15386 -19.065  < 2e-16 ***
monthMay       -3.07039    0.15810 -19.421  < 2e-16 ***
monthJune      -3.13884    0.16517 -19.004  < 2e-16 ***
monthJuly      -3.01402    0.17870 -16.866  < 2e-16 ***
monthAugust    -2.17961    0.14964 -14.565  < 2e-16 ***
monthSeptember -0.27857    0.15818  -1.761 0.078219 .  
monthOctober   -0.80046    0.16246  -4.927 8.36e-07 ***
monthNovember   0.58832    0.15380   3.825 0.000131 ***
monthDecember  -0.79529    0.15628  -5.089 3.61e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.36 on 96098 degrees of freedom
Multiple R-squared:  0.01593,   Adjusted R-squared:  0.01582 
F-statistic: 141.5 on 11 and 96098 DF,  p-value: < 2.2e-16

plot(model)

The equation for my model is: score = 15.30675 + (−0.38234) × February + (−2.58965) × March + (−2.93347) × April + (−3.07039) × May + (−3.13884) × June + (−3.01402) × July + (−2.17961) × August + (−0.27857) × September + (−0.80046) × October + (0.58832) × November + (−0.79529) × December + ϵ

Based on an adjusted R^2 value of 0.0158 it suggests that the relationship between month of the year and inspection score only accounts for 1.58% of the variation in the data.

Based on a p-value of < 2.2e-16 and using a significance level of 0.05, I can conclude that there is a significant linear relationship between the month of the year and the inspection score.

Counting how many of each grade there is by year

Converting the date/time format to only years in a separate column

inspections_clean2$year <- year(as.POSIXct(inspections_clean2$inspectiondate, format = "%m/%d/%Y"))

Count how many instances of each letter grade were in each year

yearly_grades <- inspections_clean2 %>%
  group_by(year, grade) %>%
  summarize(count = n()) %>%
  spread(key = grade, value = count, fill = 0)

`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

# Add the numbers together to figure out the total inspections per year
yearly_grades$total <- rowSums(yearly_grades[, c("A", "B", "C")])

Order the years

yearly_grades$year<-factor(yearly_grades$year, levels=c("2015", "2016", "2017", "2018", "2019", "2020", "2021", "2022", "2023", "2024"))

I want to create a line graph using either plotly or high charter. I want the graph to show the months on the x-axis, the number of grade on the y-axis. I then want 3 lines for the grades A, B, and C. I want you to use RColorBrewer set2 for the colors.

Data Visualization

hchart(yearly_grades, "line", hcaes(x = year)) |>
  hc_title(text = "Restaurant Inspection Grades Throughout The Years") |>
  hc_caption(text = "Data Provided By: Department of Health and Mental Hygiene") |>
  hc_yAxis(title = list(text = "Number of Grades")) |>
  hc_add_series(name = "Grade A", data = yearly_grades$A, color = brewer.pal(3, "Set2")[1], showInLegend = TRUE) |>
  hc_add_series(name = "Grade B", data = yearly_grades$B, color = brewer.pal(3, "Set2")[2], showInLegend = TRUE) |>
  hc_add_series(name = "Grade C", data = yearly_grades$C, color = brewer.pal(3, "Set2")[3], showInLegend = TRUE) |>
  hc_add_series(name = "Combined", data = yearly_grades$total, color = brewer.pal(3, "Set2")[4], showInLegend = TRUE)|>
  hc_xAxis(categories = yearly_grades$year, title = list(text = "Years")) |>
  hc_legend(enabled = TRUE) |>
  hc_add_theme(hc_theme_darkunica())

Visualization Comments

The visualization represents the differences in amounts of letter grades (A,B,C, or Combined) given through each year. The most noticeable part of the visualization is that there is significantly more restaurants that were given grade A than grades B or C, this is true even if you combine the two. This is a positive thing because it means that a large majority of restaurants follow the food safety guidelines. Another note worthy observation is that from 2020-2021 there was a spike in food inspections as a whole. This could be as a direct result of the pandemic as cleanliness was at the forefront of everyones mind. The biggest spike was from 2021-2022 and I believe this spike was bigger than the previous years spike because the rules around quarantine and social distancing were not as strict as before.

References

Here’s what health inspectors look for in your restaurant, and why. NRA. (n.d.). https://restaurant.org/education-and-resources/resource-library/here-is-what-health-inspectors-look-for-in-your-restaurant-and-why/