Using this dataset provided by the Health Department of NYC here, my aim is to analyze the restaurants in New York City with the highest number of violations, and examine how different restaurant types and boroughs may affect these numbers.
The raw file is available for download at the link above.
rest_raw <- read.csv(file = "restaurants.csv")
The columns selected for containing the most relevant information here are: the CAMIS (restaurant ID), the DBA (name), borough, cuisine (type of restaurant), and the violation code.
Next, because there are a fair number of rows without any violations
listed at all, I removed any with empty strings in the column for
violation code (this dataset does not appear to contain
NA
s).
NOTE: Having no listed violation code could apparently mean any
number of things in the raw data. Some are expressly defined with “No
violations were recorded at the time of this inspection.” Others are
only noted (without any detail) as “Establishment re-opened by DOHMH.”
But most have nothing in the ACTION
or
DESCRIPTION
columns at all.
rest_raw <- rest_raw |>
select(CAMIS, DBA, BORO, CUISINE.DESCRIPTION, VIOLATION.CODE) |>
filter(VIOLATION.CODE != "")
To create a clean table for the count of violations, I first grouped by restaurant ID (CAMIS) to collapse each restaurant into 1 row. A column was added for the total number of violations; the VIOLATION.CODE column had to be removed here to ensure that each restaurant represented only one observation or row.
rest_violations <- rest_raw |>
group_by(CAMIS, DBA, BORO, CUISINE.DESCRIPTION) |>
summarize(VIOLATIONS = n())
The data frame now includes all the restaurants in NYC with at least 1 violation listed by code, with columns for the cuisine type and borough. This allows for the data to be grouped by these variables for summary statistics.
For example, a simple analysis by borough:
As expected, the share of violations for each borough generally aligns with its share of restaurants.
# totals for the city
sum_rest <- nrow(rest_violations)
sum_viol <- sum(rest_violations$VIOLATIONS)
boro_total <- rest_violations |>
group_by(BORO) |>
summarize(num_rest = n(),
pct_rest = round((num_rest / sum_rest) * 100, digits = 2),
total_viol = sum(VIOLATIONS),
pct_viol = round((total_viol / sum_viol) * 100, digits = 2),
viol_per_rest = round(total_viol / num_rest))
knitr::kable(boro_total)
BORO | num_rest | pct_rest | total_viol | pct_viol | viol_per_rest |
---|---|---|---|---|---|
Bronx | 2316 | 8.69 | 22037 | 8.94 | 10 |
Brooklyn | 6912 | 25.93 | 66684 | 27.05 | 10 |
Manhattan | 10312 | 38.69 | 90422 | 36.68 | 9 |
Queens | 6120 | 22.96 | 58392 | 23.69 | 10 |
Staten Island | 995 | 3.73 | 8984 | 3.64 | 9 |
Here is a similar analysis by cuisine type:
In this case, though the percentage differences may not be significant, they can be seen as consistent: Chinese, Latin American, Mexican and Caribbean (cuisines with a perception of being more “ethnic”) received a higher share of violations than American, Italian, cafes and Japanese. These cuisines may be perceived as less “ethnic”, more expensive, etc.
cuisine_total <- rest_violations |>
group_by(CUISINE.DESCRIPTION) |>
summarize(num_rest = n(),
pct_rest = round((num_rest / sum_rest) * 100, digits = 2),
total_viol = sum(VIOLATIONS),
pct_viol = round((total_viol / sum_viol) * 100, digits = 2),
viol_per_rest = round(total_viol / num_rest)) |>
slice_max(pct_viol, n = 10)
knitr::kable(cuisine_total)
CUISINE.DESCRIPTION | num_rest | pct_rest | total_viol | pct_viol | viol_per_rest |
---|---|---|---|---|---|
American | 5019 | 18.83 | 39903 | 16.19 | 8 |
Chinese | 2202 | 8.26 | 23973 | 9.72 | 11 |
Coffee/Tea | 2059 | 7.72 | 17251 | 7.00 | 8 |
Pizza | 1562 | 5.86 | 15116 | 6.13 | 10 |
Latin American | 883 | 3.31 | 10268 | 4.17 | 12 |
Mexican | 1016 | 3.81 | 9960 | 4.04 | 10 |
Bakery Products/Desserts | 880 | 3.30 | 9881 | 4.01 | 11 |
Caribbean | 763 | 2.86 | 9097 | 3.69 | 12 |
Japanese | 991 | 3.72 | 8381 | 3.40 | 8 |
Italian | 980 | 3.68 | 7939 | 3.22 | 8 |
However, does any of this really mean that these cuisine types or boroughs are more likely to be non-compliant with health and safety? Or could there be a perception by the individual inspectors, the city health department, or even the dining public that certain types of restaurants deserve more or harsher inspections? Information for the city’s inspectors is not provided, which could have provided another dimension for analysis.