Project 2: NYC Restaurants

Using this dataset provided by the Health Department of NYC here, my aim is to analyze the restaurants in New York City with the highest number of violations, and examine how different restaurant types and boroughs may affect these numbers.

Import data, select columns and filter rows

The raw file is available for download at the link above.

rest_raw <- read.csv(file = "restaurants.csv")

The columns selected for containing the most relevant information here are: the CAMIS (restaurant ID), the DBA (name), borough, cuisine (type of restaurant), and the violation code.

Next, because there are a fair number of rows without any violations listed at all, I removed any with empty strings in the column for violation code (this dataset does not appear to contain NAs).

NOTE: Having no listed violation code could apparently mean any number of things in the raw data. Some are expressly defined with “No violations were recorded at the time of this inspection.” Others are only noted (without any detail) as “Establishment re-opened by DOHMH.” But most have nothing in the ACTION or DESCRIPTION columns at all.

rest_raw <- rest_raw |>
  select(CAMIS, DBA, BORO, CUISINE.DESCRIPTION, VIOLATION.CODE) |>
  filter(VIOLATION.CODE != "")

Create a new table

To create a clean table for the count of violations, I first grouped by restaurant ID (CAMIS) to collapse each restaurant into 1 row. A column was added for the total number of violations; the VIOLATION.CODE column had to be removed here to ensure that each restaurant represented only one observation or row.

rest_violations <- rest_raw |>
  group_by(CAMIS, DBA, BORO, CUISINE.DESCRIPTION) |>
  summarize(VIOLATIONS = n())

Analysis

The data frame now includes all the restaurants in NYC with at least 1 violation listed by code, with columns for the cuisine type and borough. This allows for the data to be grouped by these variables for summary statistics.

For example, a simple analysis by borough:

for each borough, the total number of restaurants and the proportion of the city total
then the total number of violations and the proportion of the city total
and just for curiosity’s sake, the approx. number of violations per restaurant

As expected, the share of violations for each borough generally aligns with its share of restaurants.

# totals for the city
sum_rest <- nrow(rest_violations)
sum_viol <- sum(rest_violations$VIOLATIONS)

boro_total <- rest_violations |>
  group_by(BORO) |>
  summarize(num_rest = n(),
            pct_rest = round((num_rest / sum_rest) * 100, digits = 2),
            total_viol = sum(VIOLATIONS),
            pct_viol = round((total_viol / sum_viol) * 100, digits = 2),
            viol_per_rest = round(total_viol / num_rest))

knitr::kable(boro_total)

BORO	num_rest	pct_rest	total_viol	pct_viol	viol_per_rest
Bronx	2316	8.69	22037	8.94	10
Brooklyn	6912	25.93	66684	27.05	10
Manhattan	10312	38.69	90422	36.68	9
Queens	6120	22.96	58392	23.69	10
Staten Island	995	3.73	8984	3.64	9

Here is a similar analysis by cuisine type:

for each cuisine type, the total number of restaurants and the proportion of the city total
then the total number of violations and the proportion of the city total
select only the top 10 by percent of violations for view here

In this case, though the percentage differences may not be significant, they can be seen as consistent: Chinese, Latin American, Mexican and Caribbean (cuisines with a perception of being more “ethnic”) received a higher share of violations than American, Italian, cafes and Japanese. These cuisines may be perceived as less “ethnic”, more expensive, etc.

cuisine_total <- rest_violations |>
  group_by(CUISINE.DESCRIPTION) |>
  summarize(num_rest = n(),
            pct_rest = round((num_rest / sum_rest) * 100, digits = 2),
            total_viol = sum(VIOLATIONS),
            pct_viol = round((total_viol / sum_viol) * 100, digits = 2),
            viol_per_rest = round(total_viol / num_rest)) |>
  slice_max(pct_viol, n = 10)

knitr::kable(cuisine_total)

CUISINE.DESCRIPTION	num_rest	pct_rest	total_viol	pct_viol	viol_per_rest
American	5019	18.83	39903	16.19	8
Chinese	2202	8.26	23973	9.72	11
Coffee/Tea	2059	7.72	17251	7.00	8
Pizza	1562	5.86	15116	6.13	10
Latin American	883	3.31	10268	4.17	12
Mexican	1016	3.81	9960	4.04	10
Bakery Products/Desserts	880	3.30	9881	4.01	11
Caribbean	763	2.86	9097	3.69	12
Japanese	991	3.72	8381	3.40	8
Italian	980	3.68	7939	3.22	8

Conclusion

However, does any of this really mean that these cuisine types or boroughs are more likely to be non-compliant with health and safety? Or could there be a perception by the individual inspectors, the city health department, or even the dining public that certain types of restaurants deserve more or harsher inspections? Information for the city’s inspectors is not provided, which could have provided another dimension for analysis.

Project 2: NYC Restaurants

Stephanie Chiang

2024-10-07