1 - Introduction

I have been curious about the Restaurant inspections that lead to letters in the doors of these establishments, so the analysis will consist in determine if there’s any relation/bias between the grade and the cuisine description. Since the process run by the Department of Health is a very technical one, seems there’s no connection between the variables, but we are going to look at the data to see what it says.

I care about this because the system is an helpful tool for consumers but I don’t understand how it runs and this exercise aims to do that.

I think others should care too because this can gives an idea and how the lettering system works and be informed as well.

According to the Blue Book [http://www1.nyc.gov/assets/doh/downloads/pdf/rii/blue-book.pdf] The process is very technical and every restaurant can earn an A, but certain procedures like Leavening dough, Fermenting, Dry-ageing beef or Marinating foods typical of certain cuisines might be considered as violations. Taking the Cuisine Description as the exploratory variable could help to clarify that.

2 - Data

here are 390000+ inspection results in a data set provided by the city of NYC.

I divided those records by 4 datasets, those with a grade of A (quivalent to 0-13 points), those with a grade of B (14-26 points), those with a grade of C (27 points or higher) and those with no letter assigned or pending review. after that we will randomize and take N = 1000 cases in order to have equal size groups.

Data collection: The data was collected by the Department of Health and Public Safety inspectors from 2013 to the present and it was sourced from NYC Open data.

Cases: Every case represents an entry for a restaurant made by an inspector, this can be a violation, a note that indicates no violation, an establisment closure or re-open, a case also can have a grade, a restarant can have multiple cases per day.

Variables: the Grade of the Inspection, shown as GRADE, it’s a categorical variable and will act as the response variable.

the Cuisine Description, will be the explanatory variable, that is a categorical variable as well.

Type of study: this is an observational study as this is already collected and published. It is not captured randomly as every restaurant owner know he will face an inspection every year.

scope of inference : the randomization after breaking the groups plus the group size of 1000 cases which is less than 10% of the population of restaurant in NYC will allow us to do inference on the population of restaurants.

3 - Exploratory data analysis

a_5000 <- inspections_a[sample(nrow(inspections_a), 5000),]
b_5000 <- inspections_b[sample(nrow(inspections_b), 5000),]
c_5000 <- inspections_c[sample(nrow(inspections_c), 5000),]
other_5000 <- inspections_other[sample(nrow(inspections_other), 5000),]

hist(all_inspections$SCORE, breaks=100)

hist(all_inspections[all_inspections$SCORE < 30,]$SCORE, breaks=100, xlab="score", main="Frequency of score > all entries")

> The bulk of scores goes from 0 to 20, the Grades of the mejority of restaurants are A or B, which tells about the effort done by the inspectors and restaurants to deliver a system that searchs for better quality in the new york restaurant industry. Note that there’s a big slope at 13, which is the number between an A and a B grade.

summary(all_inspections$SCORE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   -2.00   11.00   14.00   18.92   24.00  151.00   22175

In order to reduce outliers i will filter values with scores higher than 30, because for practical reasons scores higher than 30 are worse than a C in that case the restaurant owner can opt to set a ‘Pending’ sign instead of the letter.

summary(all_inspections[all_inspections$SCORE < 30,]$SCORE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   -2.00   10.00   13.00   14.53   20.00   29.00   22175
all_american = subset(all_inspections, CUISINE.DESCRIPTION == 'American')
all_chinese = subset(all_inspections, CUISINE.DESCRIPTION == 'Chinese')
all_italian = subset(all_inspections, CUISINE.DESCRIPTION == 'Italian')
all_mexican = subset(all_inspections, CUISINE.DESCRIPTION == 'Mexican')

par(mfrow=c(2,2))
plot(all_american$GRADE, main = "American")
plot(all_chinese$GRADE, main = "Chinese")
plot(all_italian$GRADE, main = "Italian")
plot(all_mexican$GRADE, main = "Mexican")

Apart from the big amount of B grades from the chinese comparing to the A grades, there’s nothing remoarkable about these plots, which are all very similar. The big bar at the left is for Blank grades, which are given when multiple violations are issues so only one of the violations holds the grade.

the most popular violations are the same across different cuisines

counts_american <- count(all_american, 'VIOLATION.DESCRIPTION')
counts_chinese <- count(all_chinese, 'VIOLATION.DESCRIPTION')
counts_italian <- count(all_italian, 'VIOLATION.DESCRIPTION')
counts_mexican <- count(all_mexican, 'VIOLATION.DESCRIPTION')

head(counts_american[order(-counts_american$freq),])
##                                                                                                                                                                                                                                                              VIOLATION.DESCRIPTION
## 56 Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
## 22                                                                                                                                              Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.
## 30                                                                                                                                  Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
## 14                                                                                                                                              Cold food item held above 41º F (smoked fish and reduced oxygen packaged foods above 38 ºF) except during necessary preparation.
## 35                                                                                                                                                      Food not protected from potential source of contamination during storage, preparation, transportation, display or service.
## 20                                                                                                                                                                                                 Evidence of mice or live mice present in facility's food and/or non-food areas.
##     freq
## 56 12884
## 22  8669
## 30  7461
## 14  6039
## 35  5586
## 20  5477
head(counts_chinese[order(-counts_chinese$freq),])
##                                                                                                                                                                                                                                                              VIOLATION.DESCRIPTION
## 52 Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
## 21                                                                                                                                              Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.
## 32                                                                                                                                                      Food not protected from potential source of contamination during storage, preparation, transportation, display or service.
## 19                                                                                                                                                                                                 Evidence of mice or live mice present in facility's food and/or non-food areas.
## 14                                                                                                                                              Cold food item held above 41º F (smoked fish and reduced oxygen packaged foods above 38 ºF) except during necessary preparation.
## 43                                                                                                                                                                                                                                     Hot food item not held at or above 140º F.
##    freq
## 52 5820
## 21 4415
## 32 3480
## 19 3237
## 14 3167
## 43 3099
head(counts_italian[order(-counts_italian$freq),])
##                                                                                                                                                                                                                                                              VIOLATION.DESCRIPTION
## 52 Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
## 19                                                                                                                                              Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.
## 26                                                                                                                                  Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
## 31                                                                                                                                                      Food not protected from potential source of contamination during storage, preparation, transportation, display or service.
## 12                                                                                                                                              Cold food item held above 41º F (smoked fish and reduced oxygen packaged foods above 38 ºF) except during necessary preparation.
## 60                                              Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly.
##    freq
## 52 2254
## 19 1608
## 26 1508
## 31 1300
## 12 1246
## 60 1216
head(counts_mexican[order(-counts_mexican$freq),])
##                                                                                                                                                                                                                                                                                   VIOLATION.DESCRIPTION
## 50                      Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
## 19                                                                                                                                                                   Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.
## 12                                                                                                                                                                   Cold food item held above 41º F (smoked fish and reduced oxygen packaged foods above 38 ºF) except during necessary preparation.
## 17                                                                                                                                                                                                                      Evidence of mice or live mice present in facility's food and/or non-food areas.
## 30                                                                                                                                                                           Food not protected from potential source of contamination during storage, preparation, transportation, display or service.
## 21 Filth flies or food/refuse/sewage-associated (FRSA) flies present in facility\032s food and/or non-food areas. Filth flies include house flies, little house flies, blow flies, bottle flies and flesh flies. Food/refuse/sewage-associated flies include fruit flies, drain flies and Phorid flies.
##    freq
## 50 1891
## 19 1626
## 12 1048
## 17  979
## 30  977
## 21  954
# ggplot(a_5000, aes(x = CUISINE.DESCRIPTION)) + geom_bar()
t1 <- table(a_5000$CUISINE.DESCRIPTION)
t2 <- table(b_5000$CUISINE.DESCRIPTION)
t3 <- table(c_5000$CUISINE.DESCRIPTION)
t4 <- table(other_5000$CUISINE.DESCRIPTION)
# to check the most popular cuisines
par(mfrow=c(4,1))
plot(sort(t1, decreasing = TRUE)[1:10], type = 'h', cex.axis=0.8 )
plot(sort(t2, decreasing = TRUE)[1:10], type = 'h', cex.axis=0.8 )
plot(sort(t3, decreasing = TRUE)[1:10], type = 'h', cex.axis=0.8 )
plot(sort(t4, decreasing = TRUE)[1:10], type = 'h', cex.axis=0.8 )

the biggest categories are american, chinese, cafe/tea, italian, pizza and mexican across all different grades, some of them vary their position across grades, but it keeps the order for the top 2, american and chinese.

filtered_inspections <- subset(all_inspections, all_inspections$SCORE < 30)
inspections_1000 <- filtered_inspections[sample(nrow(filtered_inspections), 1000),]
hist(inspections_1000$SCORE, breaks=100, xlab="score", main="Frequency of score > random sample of 1000")

summary(inspections_1000$SCORE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -1.00   10.00   13.00   14.62   20.00   29.00
sd(inspections_1000$SCORE)
## [1] 6.629819

the sample taken is random, independent and represent less than 10% of the population (there are approximately 24000 restaurants in nyc)

we calculate \(\ SE = \sigma / \sqrt n\) as follows: $ SE = 6.7146 / 1000 $ = 0.21233

we can be confident that we have 95% values in this range $ 14.61 1.96 * 0.21233 $ (14.1938, 15.0261)

with 99% confidence:$ 14.61 2.58 * 0.21233 $ (14.0621, 15.157)

A lot of cases struggle to get into the A grade and several cases are around these values.

4 - Inference

Using linear regression to determine the different cuisines load to the overall score total..

all4 <- rbind(all_american, all_chinese)
all4 <- rbind(all4, all_mexican)
all4 <- rbind(all4, all_italian)
all4 <- rbind(all4, all_american)
res <- lm(SCORE ~ CUISINE.DESCRIPTION, all4 )
summary(res)
## 
## Call:
## lm(formula = SCORE ~ CUISINE.DESCRIPTION, data = all4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.435  -8.058  -5.058   4.942 100.400 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                18.05807    0.03041 593.884  < 2e-16 ***
## CUISINE.DESCRIPTIONChinese  2.37705    0.06906  34.420  < 2e-16 ***
## CUISINE.DESCRIPTIONItalian  0.54238    0.10179   5.328 9.92e-08 ***
## CUISINE.DESCRIPTIONMexican  1.88687    0.10852  17.388  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.47 on 239415 degrees of freedom
##   (13918 observations deleted due to missingness)
## Multiple R-squared:  0.005649,   Adjusted R-squared:  0.005636 
## F-statistic: 453.4 on 3 and 239415 DF,  p-value: < 2.2e-16

using only the coefficients of the most important cuisines we can get our regression formula: $ 18.0581 + 2.3770(chinese) + 0.5424(Italian) + 1.8869(Mexican) $

The way to interpret this is that some cuisines have made more violations in the past, hence the positive slope, but none of this values is determinant to predict the score.

5 - Conclusion

The Cuisine description is not determinant of the score in any way, the process to calculate the score is more oriented to food handling, hygiene and pest control, different food preparations or storaged goods do not explicitely alter this value.