Project 2: NYC Restaurants

Using this dataset provided by the Health Department of NYC here, my aim is to analyze the restaurants in New York City with the highest number of violations, and examine how different restaurant types and boroughs may affect these numbers.

Import data, select columns and filter rows

The raw file is available for download at the link above.

rest_raw <- read.csv(file = "restaurants.csv")

The columns selected for containing the most relevant information here are: the CAMIS (restaurant ID), the DBA (name), borough, cuisine (type of restaurant), and the violation code.

Next, because there are a fair number of rows without any violations listed at all, I removed any with empty strings in the column for violation code (this dataset does not appear to contain NAs).

NOTE: Having no listed violation code could apparently mean any number of things in the raw data. Some are expressly defined with “No violations were recorded at the time of this inspection.” Others are only noted (without any detail) as “Establishment re-opened by DOHMH.” But most have nothing in the ACTION or DESCRIPTION columns at all.

rest_raw <- rest_raw |>
  select(CAMIS, DBA, BORO, CUISINE.DESCRIPTION, VIOLATION.CODE) |>
  filter(VIOLATION.CODE != "")

Create a new table

To create a clean table for the count of violations, I first grouped by restaurant ID (CAMIS) to collapse each restaurant into 1 row. A column was added for the total number of violations; the VIOLATION.CODE column had to be removed here to ensure that each restaurant represented only one observation or row.

rest_violations <- rest_raw |>
  group_by(CAMIS, DBA, BORO, CUISINE.DESCRIPTION) |>
  summarize(VIOLATIONS = n())

Analysis

The data frame now includes all the restaurants in NYC with at least 1 violation listed by code, with columns for the cuisine type and borough. This allows for the data to be grouped by these variables for summary statistics.

For example, a simple analysis by borough:

  • for each borough, the total number of restaurants and the proportion of the city total
  • then the total number of violations and the proportion of the city total
  • and just for curiosity’s sake, the approx. number of violations per restaurant

As expected, the share of violations for each borough generally aligns with its share of restaurants.

# totals for the city
sum_rest <- nrow(rest_violations)
sum_viol <- sum(rest_violations$VIOLATIONS)

boro_total <- rest_violations |>
  group_by(BORO) |>
  summarize(num_rest = n(),
            pct_rest = round((num_rest / sum_rest) * 100, digits = 2),
            total_viol = sum(VIOLATIONS),
            pct_viol = round((total_viol / sum_viol) * 100, digits = 2),
            viol_per_rest = round(total_viol / num_rest))

knitr::kable(boro_total)
BORO num_rest pct_rest total_viol pct_viol viol_per_rest
Bronx 2316 8.69 22037 8.94 10
Brooklyn 6912 25.93 66684 27.05 10
Manhattan 10312 38.69 90422 36.68 9
Queens 6120 22.96 58392 23.69 10
Staten Island 995 3.73 8984 3.64 9

Here is a similar analysis by cuisine type:

  • for each cuisine type, the total number of restaurants and the proportion of the city total
  • then the total number of violations and the proportion of the city total
  • select only the top 10 by percent of violations for view here

In this case, though the percentage differences may not be significant, they can be seen as consistent: Chinese, Latin American, Mexican and Caribbean (cuisines with a perception of being more “ethnic”) received a higher share of violations than American, Italian, cafes and Japanese. These cuisines may be perceived as less “ethnic”, more expensive, etc.

cuisine_total <- rest_violations |>
  group_by(CUISINE.DESCRIPTION) |>
  summarize(num_rest = n(),
            pct_rest = round((num_rest / sum_rest) * 100, digits = 2),
            total_viol = sum(VIOLATIONS),
            pct_viol = round((total_viol / sum_viol) * 100, digits = 2),
            viol_per_rest = round(total_viol / num_rest)) |>
  slice_max(pct_viol, n = 10)

knitr::kable(cuisine_total)
CUISINE.DESCRIPTION num_rest pct_rest total_viol pct_viol viol_per_rest
American 5019 18.83 39903 16.19 8
Chinese 2202 8.26 23973 9.72 11
Coffee/Tea 2059 7.72 17251 7.00 8
Pizza 1562 5.86 15116 6.13 10
Latin American 883 3.31 10268 4.17 12
Mexican 1016 3.81 9960 4.04 10
Bakery Products/Desserts 880 3.30 9881 4.01 11
Caribbean 763 2.86 9097 3.69 12
Japanese 991 3.72 8381 3.40 8
Italian 980 3.68 7939 3.22 8

Conclusion

However, does any of this really mean that these cuisine types or boroughs are more likely to be non-compliant with health and safety? Or could there be a perception by the individual inspectors, the city health department, or even the dining public that certain types of restaurants deserve more or harsher inspections? Information for the city’s inspectors is not provided, which could have provided another dimension for analysis.