I would like to get the blog post started with a visualization I created in CartoDB and Lat/Lon coordinates calculated using the Bing geocoding API. The points represent Grades handed out by the Department Of Health and Mental Hygiene over the past five years. All of the code can be found about half way down the blog post. It should be noted that this visualization is not meant to convey a specific message, it serves as more of a pretty proof of concept.
require(data.table)
require(dplyr)
require(lattice)
require(gdata)
require(ggplot2)
require(taRifx.geo)
require(gdata)
require(tidyr)
require(arules)
require(arulesViz)
require(colorspace)
## Warning: package 'colorspace' was built under R version 3.1.3
require(foreach)
For my first data science project, I worked with five years of data from the NYC Restaurant Inspection Results data set. This data is created by the NYC Department of Health and Mental Hygiene. Within this data you’ll find all of the restaurant inspections and violations and grades for 26,000 restuarants in New York City.
If you’re not familiar with the NYC Restaurant Inspection Grading system, the basic idea is that inspectors periodically visit restaurants and count up the violations. The violations all carry a certain amount of points, and the points are tallied and the restaurant receives a grade. Ultimately, the restaurant is forced to display the grade on their entryway.
Here is the score breakdown:
## Points: Score:
## <14 A
## 14-27 B
## >27 C
Let’s get started…
Step 1: Read in the data.
inspectionDF <- fread("DOHMH_New_York_City_Restaurant_Inspection_Results.csv", header = T,
sep=',', na.strings=c("NA","N/A",""), colClasses = "character")
##
Read 0.0% of 511830 rows
Read 511830 rows and 18 (of 18) columns from 0.178 GB file in 00:00:04
I then converted the file into a dplyr data table for easier manipulation using the dplyr package
inspecttbl<-tbl_df(inspectionDF)
It is always a good idea to know what your data looks like. This one gives us a sneak peak of the entries for each column.
glimpse(inspecttbl)
## Observations: 511830
## Variables:
## $ CAMIS (chr) "30075445", "30075445", "30075445", "300...
## $ DBA (chr) "MORRIS PARK BAKE SHOP", "MORRIS PARK BA...
## $ BORO (chr) "BRONX", "BRONX", "BRONX", "BRONX", "BRO...
## $ BUILDING (chr) "1007", "1007", "1007", "1007", "1007", ...
## $ STREET (chr) "MORRIS PARK AVE", "MORRIS PARK AVE", "M...
## $ ZIPCODE (chr) "10462", "10462", "10462", "10462", "104...
## $ PHONE (chr) "7188924968", "7188924968", "7188924968"...
## $ CUISINE DESCRIPTION (chr) "Bakery", "Bakery", "Bakery", "Bakery", ...
## $ INSPECTION DATE (chr) "02/09/2015", "03/03/2014", "10/10/2013"...
## $ ACTION (chr) "Violations were cited in the following ...
## $ VIOLATION CODE (chr) "06C", "10F", NA, "04L", "04N", "04C", "...
## $ VIOLATION DESCRIPTION (chr) "Food not protected from potential sourc...
## $ CRITICAL FLAG (chr) "Critical", "Not Critical", "Not Applica...
## $ SCORE (chr) "6", "2", NA, "6", "6", "32", "32", "32"...
## $ GRADE (chr) "A", "A", NA, "A", "A", NA, NA, NA, NA, ...
## $ GRADE DATE (chr) "02/09/2015", "03/03/2014", NA, "09/11/2...
## $ RECORD DATE (chr) "02/12/2015", "02/12/2015", "02/12/2015"...
## $ INSPECTION TYPE (chr) "Cycle Inspection / Initial Inspection",...
Now it is time to start tidying up my data One of the first things I did was convert the date information into date class objects that R can read using the as.Date function.
inspecttbl$"INSPECTION DATE"<-(as.Date(inspecttbl$'INSPECTION DATE', "%m/%d/%Y"))
inspecttbl$"GRADE DATE"<-(as.Date(inspecttbl$'GRADE DATE', "%m/%d/%Y"))
inspecttbl$"RECORD DATE"<-(as.Date(inspecttbl$'RECORD DATE', "%m/%d/%Y"))
class(inspecttbl$'RECORD DATE')
## [1] "Date"
While we are at it, let’s set the Grades as ordered factors
inspecttbl$GRADE <- factor(inspecttbl$GRADE, levels = c("A", "B", "C", "Z", "Not Yet Graded"), ordered = TRUE)
(levels(inspecttbl$GRADE))
## [1] "A" "B" "C" "Z"
## [5] "Not Yet Graded"
As well as the Cuisine Descriptions and Violation Codes and Descriptions
inspecttbl$`CUISINE DESCRIPTION` <- factor(inspecttbl$`CUISINE DESCRIPTION`)
#inspecttbl$`VIOLATION CODE` <- factor(inspecttbl$`VIOLATION CODE`)
#inspecttbl$`VIOLATION DESCRIPTION` <- factor(inspecttbl$`VIOLATION DESCRIPTION`)
head(levels(inspecttbl$`CUISINE DESCRIPTION`))
## [1] "Afghan" "African" "American " "Armenian" "Asian"
## [6] "Australian"
#levels(inspecttbl$`VIOLATION CODE`)
#levels(inspecttbl$`VIOLATION DESCRIPTION`)
Let’s set the Critical flag as an ordered factor
inspecttbl$`CRITICAL FLAG` <- factor(inspecttbl$`CRITICAL FLAG`, levels = c("Critical", "Not Critical", "Not Applicable"), ordered = T)
levels(inspecttbl$`CRITICAL FLAG`)
## [1] "Critical" "Not Critical" "Not Applicable"
Let’s now set the Identifier and scores as numeric vectors
inspecttbl$CAMIS <- as.numeric(inspecttbl$CAMIS)
inspecttbl$SCORE <- as.numeric(inspecttbl$SCORE)
class(inspecttbl$CAMIS)
## [1] "numeric"
class(inspecttbl$SCORE)
## [1] "numeric"
The number of distinct restaurants
n_distinct(inspecttbl$CAMIS)
## [1] 25396
Next lets take a look at how far back these inspections go
range(inspecttbl$`INSPECTION DATE`)
## [1] "1900-01-01" "2015-02-11"
That’s strange - This brings me to my next point - Documentation. The readme file can sometimes be your best friend. I wouldn’t have been able to work this data at all had there been no documentation.
When you look at the documentation you’ll see that they used the January 1st, 1900 date as a marker for restaurants that haven’t been inspected yet
So, let’s clean those rows in which the date is listed as “1900-01-01”
inspecttbl<-filter(inspecttbl, `INSPECTION DATE` != "1900-01-01")
Great, this cuts out 797 entries. Let’s see what our range looks like now
range(inspecttbl$`INSPECTION DATE`)
## [1] "2008-09-18" "2015-02-11"
That’s more like it.
If we take another look at the documentation we’ll see that some of these entries do not provide us with the information we are looking for. For instance, some of the inspection types inappropriately color our data. Trans Fat and Calorie type inspections shouldn’t count because they are not gradable and they are often performed before a scheduled inspection. Thus the establishments were not prepared specifically for food safety inspections. Compliance, Special Program, and Administrative violations similarly color the data, and thus need to be excluded.
At the end of the day, there are only four gradable inspection types. Since what we are interested in is the report card data, we can select for rows that only contain gradable types. Those types are:
Now, before running a filter for these four types, let’s make sure there are no subtle variations in spelling of these types otherwise our filter would uneccessarily exclude the variations. We can use the levels and the as.factors functions to facilitate this inquiry.
levels(as.factor(inspecttbl$`INSPECTION TYPE`))
## [1] "Administrative Miscellaneous / Compliance Inspection"
## [2] "Administrative Miscellaneous / Initial Inspection"
## [3] "Administrative Miscellaneous / Re-inspection"
## [4] "Administrative Miscellaneous / Reopening Inspection"
## [5] "Administrative Miscellaneous / Second Compliance Inspection"
## [6] "Calorie Posting / Compliance Inspection"
## [7] "Calorie Posting / Initial Inspection"
## [8] "Calorie Posting / Re-inspection"
## [9] "Calorie Posting / Second Compliance Inspection"
## [10] "Cycle Inspection / Compliance Inspection"
## [11] "Cycle Inspection / Initial Inspection"
## [12] "Cycle Inspection / Re-inspection"
## [13] "Cycle Inspection / Reopening Inspection"
## [14] "Cycle Inspection / Second Compliance Inspection"
## [15] "Inter-Agency Task Force / Initial Inspection"
## [16] "Inter-Agency Task Force / Re-inspection"
## [17] "Pre-permit (Non-operational) / Compliance Inspection"
## [18] "Pre-permit (Non-operational) / Initial Inspection"
## [19] "Pre-permit (Non-operational) / Re-inspection"
## [20] "Pre-permit (Non-operational) / Second Compliance Inspection"
## [21] "Pre-permit (Operational) / Compliance Inspection"
## [22] "Pre-permit (Operational) / Initial Inspection"
## [23] "Pre-permit (Operational) / Re-inspection"
## [24] "Pre-permit (Operational) / Reopening Inspection"
## [25] "Pre-permit (Operational) / Second Compliance Inspection"
## [26] "Smoke-Free Air Act / Complaint (Initial Inspection)"
## [27] "Smoke-Free Air Act / Compliance Inspection"
## [28] "Smoke-Free Air Act / Initial Inspection"
## [29] "Smoke-Free Air Act / Limited Inspection"
## [30] "Smoke-Free Air Act / Re-inspection"
## [31] "Smoke-Free Air Act / Second Compliance Inspection"
## [32] "Trans Fat / Compliance Inspection"
## [33] "Trans Fat / Initial Inspection"
## [34] "Trans Fat / Re-inspection"
## [35] "Trans Fat / Second Compliance Inspection"
It looks like these factors are consistently labeled - especially the ones we want, so let’s focus our data on the gradable inspection types.
gradables <- c("Cycle Inspection / Initial Inspection" , "Cycle Inspection / Re-inspection",
"Pre-permit (Operational) / Initial Inspection", "Pre-permit (Operational) / Re-inspection")
inspecttbl <- (filter(inspecttbl, `INSPECTION TYPE` %in% gradables))
Let’s confirm that the filter worked as anticipated
levels(as.factor(inspecttbl$`INSPECTION TYPE`))
## [1] "Cycle Inspection / Initial Inspection"
## [2] "Cycle Inspection / Re-inspection"
## [3] "Pre-permit (Operational) / Initial Inspection"
## [4] "Pre-permit (Operational) / Re-inspection"
Great - we can see that what we are left with are only the rows that contain gradable inspection types.
We still have approximately 461,000 observations. This is good.
However, if we look into these observations, we see that not every entry records a score. Of the 416,000 observations, only a fraction had usable grades since each violation was given its own observation row. When a restaurant gets an inspection, only one of the rows associated with that inspection date might have a grade. Let’s take a look at what happens when we filter our results for the main observation rows that contain actual grades.
grades <- c("A", "B", "C")
nrow(filter(inspecttbl, GRADE %in% grades))
## [1] 216868
We see that we have about 217,000 entries with actual grades in them. Not bad
When we take a look at our datatable, it can be difficult to get a sense of the big picture on a per restaurant basis since there is almost too much information. Each violation is labled as its own observation, making the data difficult to read. So, let’s create a new data frame, one that compounds the violations into one observation per restaurant per inspection day. To do that, I used the select function to remove superfluous columns and the distinct function to aggregate rows with identical information.
shrunkenData <- (distinct(select(inspecttbl, CAMIS, DBA, BORO, BUILDING, STREET, ZIPCODE, `CUISINE DESCRIPTION`, `INSPECTION DATE`, ACTION, SCORE, GRADE, `INSPECTION TYPE`)))
We can verify that each inspection date has one observation by peeking into the data.
shrunkenData$`INSPECTION DATE`[1:20]
## [1] "2015-02-09" "2014-03-03" "2013-09-11" "2013-08-14" "2013-01-24"
## [6] "2012-12-31" "2014-12-30" "2014-11-13" "2014-07-01" "2014-06-05"
## [11] "2013-04-30" "2012-05-08" "2014-09-06" "2013-07-22" "2012-07-31"
## [16] "2011-12-29" "2011-12-15" "2014-06-10" "2013-06-05" "2012-04-13"
A quick nrow shows us that I have 149,000 unique inspections in my data now. nrow(shrunkenData)
Now would be a good time to take a look at our data and ask which restaurants posted the absolute worst scores but were somehow allowed to stay open to receive a grade:
print(worstOfList <- arrange(filter(shrunkenData, GRADE %in% grades), desc(SCORE))[1:20,c(2,10)])
## Source: local data frame [20 x 2]
##
## DBA SCORE
## 1 MURALS ON 54/RANDOLPHS'S 131
## 2 BELLA NAPOLI 98
## 3 BALUCHI'S INDIAN FOOD 98
## 4 GANDHI 92
## 5 CONCRETE RESTAURANT 90
## 6 WEST 79TH STREET BOAT BASIN CAFE 89
## 7 D & Y RESTAURANT 86
## 8 SPICY SHALLOT 84
## 9 BISTRO CATERERS 84
## 10 CAFE R 82
## 11 LA POTENCIA RESTAURANT 82
## 12 ANELLA 81
## 13 B.B. KINGS 80
## 14 LA TRATTORIA 79
## 15 LEXINGTON RESTAURANT 79
## 16 MARS CAFE 78
## 17 GAL BI MA EUL 78
## 18 CHEIKH UMAR FUTIYU RESTAURANT 78
## 19 FORTUNATO BROS CAFE & BAKERY 77
## 20 MIDTOWN BUFFET 77
The next plot shows our Worst of the Worst restaurants in graphical form, with the minimum thresholds for B’s and C’s labeled as yellow and red lines, respectively.
worstOfList$DBA <- factor(worstOfList$DBA, levels = rev(worstOfList$DBA))
ggplot(worstOfList, aes(x = `DBA`, y = `SCORE`)) +
geom_bar(stat = 'identity', aes(fill = SCORE)) +
scale_fill_gradient(
low = "lightgreen",
high = "green3", guide = 'none') +
geom_hline(yintercept = 14, color = "yellow") +
geom_hline(yintercept = 27, color = "red") +
theme(text = element_text(size=10),
axis.text.x = element_text(angle=45, vjust=1, color = "black")) +
labs(title="Worst of the Worst")
In order to explore the data a bit more, I thought it would be interesting to look at the average grades for restaurants when grouped together by cuisine.
groupCuisine <- group_by(shrunkenData, `CUISINE DESCRIPTION`)
print(cuisineSummary <- summarize(groupCuisine,
"A's" = sum(GRADE == "A", na.rm = T)/sum(GRADE == "A" | GRADE == "B" | GRADE == "C", na.rm = T),
"B's" = sum(GRADE == "B", na.rm = T)/sum(GRADE == "A" | GRADE == "B" | GRADE == "C", na.rm = T),
"C's" = sum(GRADE == "C", na.rm = T)/sum(GRADE == "A" | GRADE == "B" | GRADE == "C", na.rm = T),
"Totals" = sum(GRADE == "A" | GRADE == "B" | GRADE == "C", na.rm = T)
))
## Source: local data frame [84 x 5]
##
## CUISINE DESCRIPTION A's B's C's Totals
## 1 Afghan 0.8636364 0.11363636 0.022727273 44
## 2 African 0.7208333 0.21666667 0.062500000 240
## 3 American 0.8615154 0.11603599 0.022448620 22674
## 4 Armenian 0.9230769 0.06593407 0.010989011 91
## 5 Asian 0.7355534 0.23016650 0.034280118 1021
## 6 Australian 0.9767442 0.02325581 0.000000000 43
## 7 Bagels/Pretzels 0.8871951 0.09756098 0.015243902 656
## 8 Bakery 0.8330888 0.14636831 0.020542920 2726
## 9 Bangladeshi 0.7000000 0.29090909 0.009090909 110
## 10 Barbecue 0.9069767 0.06976744 0.023255814 172
## .. ... ... ... ... ...
Here are the top 20 Cuisines with the highest proportion of A’s
arrange(cuisineSummary, desc(`A's`))[1:20,]
## Source: local data frame [20 x 5]
##
## CUISINE DESCRIPTION A's B's C's Totals
## 1 Californian 1.0000000 0.00000000 0.000000000 2
## 2 Chilean 1.0000000 0.00000000 0.000000000 1
## 3 Creole/Cajun 1.0000000 0.00000000 0.000000000 3
## 4 Hawaiian 1.0000000 0.00000000 0.000000000 6
## 5 Australian 0.9767442 0.02325581 0.000000000 43
## 6 Not Listed/Not Applicable 0.9714286 0.02857143 0.000000000 35
## 7 Other 0.9594272 0.03818616 0.002386635 419
## 8 Donuts 0.9532863 0.04073873 0.005975014 1841
## 9 Ice Cream, Gelato, Yogurt, Ices 0.9521436 0.04087737 0.006979063 1003
## 10 Salads 0.9508197 0.04918033 0.000000000 122
## 11 Fruits/Vegetables 0.9473684 0.05263158 0.000000000 19
## 12 Juice, Smoothies, Fruit Salads 0.9449794 0.04951857 0.005502063 727
## 13 Sandwiches 0.9416717 0.05171377 0.006614552 1663
## 14 Soups & Sandwiches 0.9411765 0.04705882 0.011764706 170
## 15 Hotdogs/Pretzels 0.9400000 0.06000000 0.000000000 50
## 16 Café/Coffee/Tea 0.9373446 0.05469915 0.007956241 4022
## 17 Ethiopian 0.9322034 0.06779661 0.000000000 59
## 18 Sandwiches/Salads/Mixed Buffet 0.9273183 0.06015038 0.012531328 798
## 19 Hamburgers 0.9255014 0.06074499 0.013753582 1745
## 20 Armenian 0.9230769 0.06593407 0.010989011 91
Here are the top 20 Cuisines with the highest proportion of C’s
arrange(cuisineSummary, desc(`C's`))[c(1:20),]
## Source: local data frame [20 x 5]
##
## CUISINE DESCRIPTION
## 1 Iranian
## 2 Chinese/Japanese
## 3 Chinese/Cuban
## 4 African
## 5 Korean
## 6 Eastern European
## 7 Pakistani
## 8 Indian
## 9 Latin (Cuban, Dominican, Puerto Rican, South & Central American)
## 10 English
## 11 Peruvian
## 12 Thai
## 13 Brazilian
## 14 Delicatessen
## 15 Scandinavian
## 16 Russian
## 17 Spanish
## 18 Indonesian
## 19 Mexican
## 20 Japanese
## Variables not shown: A's (dbl), B's (dbl), C's (dbl), Totals (int)
It turns out there is another wrinkle in the grading system. As a sort of mercy rule, the city’s policy on periodic initial inspections is that restaurants do not receive grades less than an A upon their initial inspection. Here is how it works, if an establishment scores less than 14 points on their intial inspection, they get their A and everyone is happy.
However, if the restaurant scores in the B or C range on their initial inspection, this score is not reported and they are given a grace period of one week to fix their violations before they are given their reinspection score. Whatever score they receive upon their re-inspection then becomes the score that sticks.
When you think about it, this second chance system actually seems like a well thought out policy meant to give restaurants a second opportunity to earn an A. Rather than manipulating the grading system as a means to generate revenue, this type of regulation seems well suited toward actually raising the standards for health and food safety for the benefit of New Yorkers.
What this means for people looking at the data is that any proportional analysis of the grades must be done without initial inspection scores since the only grades that are recorded for the initial inspection are either A’s or NA’s.
However, if we only look at reinspection scores, then we are effectively discarding all the of the A’s that were awarded during initial inspections. So, this means that caveats are necesssary when looking summary statistics.
Let’s look at a summary of all final grades:
summary(shrunkenData$GRADE)
## A B C Z Not Yet Graded
## 74815 12276 2289 1354 191
## NA's
## 58035
Summary of all reinspections that resulted in a final score:
summary(filter(shrunkenData, GRADE %in% grades, `INSPECTION TYPE` != "Cycle Inspection / Initial Inspection", `INSPECTION TYPE` != "Pre-permit (Operational) / Initial Inspection" )$GRADE)
## A B C Z Not Yet Graded
## 39530 12276 2289 0 0
This means that only 35,000 establishments scored an A on their first try
To see the ramifications of this rule, let’s first look at a bar chart of all of the grades of restaurants allowed to stay open by the DOHMH
filter1 <- filter(shrunkenData, GRADE %in% grades)
Let’s reset the factors and plot the results
filter(shrunkenData, GRADE %in% grades)$GRADE%>%
factor(levels = c("A", "B", "C"), ordered= T) %>%
table()%>%
barchart(horizontal = F, main = "Frequency of Grades for Establishments Allowed to Stay Open")
So, if you want to know the overall proportion of grades of open restaurants, chances are that you’ll find an A level restaurant without too much work.
If we wanted to see what this chart would look like without the city’s two strikes policy, we can estimate the grades based upon the unpublished results of the initial inspections and recompute this chart using the new theoretical grades
First we filter on the initial results then pull out the Grade data
firstStrike <- select(filter(shrunkenData, `INSPECTION TYPE`=="Cycle Inspection / Initial Inspection" | `INSPECTION TYPE` == "Pre-permit (Operational) / Initial Inspection"), CAMIS, SCORE, `INSPECTION DATE`)
Then bin the results into Grade types and mutate the grades onto to the column of scores
GRADEHYP <- cut(firstStrike$SCORE, breaks = c(0, 13, 27, 200),
labels = c("A", "B", "C"), include.lowest = T)
firstGrades <- (mutate (firstStrike,GRADEHYP))
shrunkenData <- left_join(shrunkenData, firstGrades, by = c("CAMIS", "SCORE", "INSPECTION DATE"))
Now we plot using the lattice barchart and get VERY different results
firstGrades$GRADE%>%
factor(levels = c("A", "B", "C"), ordered= T) %>%
table()%>%
barchart(horizontal = F, main = "Frequency of Estimated Grades upon Initial Inspection")
So if they actually gave out grades on their initial inspections, this is what it would look like.
It really lets you know how much an establishment can shape up once they are given a warning
This led me to my next question, which was “How far are these restaurants falling each year, from the date of their A inspection to their next initial inspection.
In a sense, this is the information that gets to the heart of the regulation. If restaurants are consistently scoring below their A rating, but they magically turn things around for their second try (their reinspection), then that ‘A’ rating really isn’t a very good metric for the kind of conditions that you can expect from that establishment. And whatever practices they use to score an A on their reinspection is not what they practice throughout the year.
So, in order to find the restaurants that have figured out how to game the system, let’s first pull all of the A’s that were handed out.
justAs <- filter(shrunkenData, GRADE =="A")
table(justAs$`INSPECTION TYPE`)
##
## Cycle Inspection / Initial Inspection
## 32392
## Cycle Inspection / Re-inspection
## 36180
## Pre-permit (Operational) / Initial Inspection
## 2893
## Pre-permit (Operational) / Re-inspection
## 3350
We can see that there are more reinspections than there are initial inspections. The establishments that earned an A on their initial inspections are not the focus of the current inquiry here. What I am interested in is the average amount that restaurants fall after receiving their A rating.
By looking at the average difference between a restaurant’s A score and their next score, we can get a metric for the overall efficacy of the law itself.
system.time({
diff = vector()
pair1 = vector()
pair2 = vector()
for (i in 1:(nrow(shrunkenData)-1)){
if ({is.na(shrunkenData[i,]$GRADE) == FALSE} & {shrunkenData[i,]$GRADE == 'B'}){
if (shrunkenData[i,]$CAMIS == shrunkenData[i+1,]$CAMIS){
diff <- append(diff, shrunkenData[i,]$SCORE - shrunkenData[i+1,]$SCORE)
pair1 <- append(pair1, shrunkenData[i,]$SCORE)
pair2 <- append(pair2, shrunkenData[i+1,]$SCORE)
}
}
}
})
save(diff, file = "diff.bin")
save(pair1, file = "pair1.bin")
save(pair2, file = "pair2.bin")
#should take an hour
If we take the mean of the differences, we get a result of -8.35.
Unfortunately this result is not quite as conclusive as I had hoped it would be.
summary(diff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -146.000 -14.000 -6.000 -8.359 0.000 14.000
boxplot(diff, horizontal = T)
hist(x = diff, col = "green3", main = "Histogram of Score Drops", xlab = "Score Drop")
abline(v = mean(diff), lwd = 4, )
abline(v = 0, lwd= 4)
sd(diff)
## [1] 11.62465
t.test(pair1, pair2, alternative = "two.sided", paired = TRUE)
##
## Paired t-test
##
## data: pair1 and pair2
## t = -183.4119, df = 65055, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8.448512 -8.269854
## sample estimates:
## mean of the differences
## -8.359183
The paired t-test tells us that restaurants clearly lower their standards after receiving an A. However, the mean difference of -8 is really not all that terrible. It is entirely possible to score an A on one review, drop 8 points and still receive an A on your next review. If the mean had been significantly lower, it would have been easier to make a conclusion about the overall efficacy of the regulations. As such, I would not suggest that the regulations, as crafted, require revision.
I wanted to represent the locations of the restaurants overlayed onto a map of the city. There are a number of questions you can answer by have a visual representation of the data, but the first step was to take the condensed address data and get a latitude and longitude ID for each establishment. Step one was to first fold the data into a dataframe containing only the unique addresses for which we needed coordinate information and their CAMIS ID.
When we look at the data, we see that a number of addresses have leading and trailing whitespace, so we have to use the trim command in the gdata package to clean up the data and make it pretty for our geocoding API. Next, we paste the relevant rows together into one string to feed to the API. Then drop the duplicates since we want as few calls to the API as possible.
allAddresses <- unique(cbind(shrunkenData$CAMIS, paste(paste(paste(paste(trim(shrunkenData$BUILDING),
trim(shrunkenData$STREET), sep = " "), trim(shrunkenData$BORO), sep = ", "),
"NY", sep = " "), trim(shrunkenData$ZIPCODE), sep = " ")))
This leaves us with 24,311 entries. Google maps has a geocoding API, but their daily limit is capped at 2,500 requests, so I chose to go with Bing for my geocoding needs. They have a higher limit.
At the time of this writing, Cran did not contain a version of taRifx.geo that was compatible with bing searches. In order to get it to work, I had to install gdal from homebrew, brew install gdal, then install the rgeos package from source. THEN install the new tarifx package using this command:
install.packages('rgeos', type='source')
install.packages('rgdal', type='source')
devtools::install_github("gsk3/taRifx.geo")
Note: this is on OS X Yosemite.
Let’s start geocoding the addresses. I split them up into manageable chunks to make it easier to keep track of them if I hit the daily limits.
Split the rows
first <- allAddresses[1:2000,2]
second <- allAddresses[2001:10000,2]
third <- allAddresses[10001:20000,2]
fourth <- allAddresses[20001:24311,2]
# geocode the vectors
#options(BingMapsKey="Aul-REDACTED")
#coord1 = lapply(first, function(x) geocode(x, service = 'bing', returntype = 'coordinates'))
#coord2 = lapply(second, function(x) geocode(x, service = 'bing', returntype = 'coordinates'))
#system.time({coord3 = lapply(third, function(x) geocode(x, service = 'bing', returntype = 'coordinates'))})
#coord4 = lapply(fourth, function(x) geocode(x, service = 'bing', returntype = 'coordinates'))
Since the data that is returned from the geocode function is represented as two rows of multiple columns, we need to transpose the data into a two column data frame, then put it back together in order.
temp1 <-(gather(data.frame(coord1)))
odds <- seq(1, by=2, len = 2000)
temp1 <- (data.frame(cbind(temp1[odds,2], temp1[odds+1, 2])))
temp2 <-(gather(data.frame(coord2)))
odds <- seq(1, by=2, len = 8000)
temp2 <- (data.frame(cbind(temp2[odds,2], temp2[odds+1, 2])))
temp3 <-(gather(data.frame(coord3)))
odds <- seq(1, by=2, len = 10000)
temp3 <- (data.frame(cbind(temp3[odds,2], temp3[odds+1, 2])))
temp4 <-(gather(data.frame(coord4)))
odds <- seq(1, by=2, len = 4311)
temp4 <- (data.frame(cbind(temp4[odds,2], temp4[odds+1, 2])))
coords <- bind_rows(temp1, temp2, temp3, temp4)
colnames(coords)<- (c("lat", "lon"))
write.csv(file = "icoords.csv", coords)
Let’s take a look at the final product
head(coords)
Now we have to combine the coordinates with their respectives CAMIS IDs. Then we have to combine the coordinate data with the rest of my grade data I used dplyrs left join function to join on CAMIS ID.
allAddresses <- bind_cols(data.frame(allAddresses[,1]), coords)
colnames(allAddresses)<- c("CAMIS", "LAT", "LON")
allAddresses$CAMIS<- as.numeric(as.character(allAddresses$CAMIS))
View(left_join(shrunkenData, allAddresses, by = "CAMIS"))
shrunkenData <- left_join(shrunkenData, allAddresses, by = "CAMIS")
#write.csv(file = "shrunkenData.csv", shrunkenData)
At this point I thought it would be fun to visualize the geocodes of the restaurants.
I input the data into a cartodb map and plotted the grades as they were awarded over time. In the end, the results are not very instructive. They are mainly there as a proof of concept.
CartoDB Link - http://cdb.io/1D4DACe
CartoDB Embedded html
I was curious to see whether some of the violations commonly occur in tandem with other violations. For instance, if an inspector spots evidence of mice, do they usually also flag them for evidence of rats.
Or maybe if a food contact surface is not properly maintained, that surface might also be improperly constructed.
Since we have a machine learning algorithm that can do this for us, it probably makes more sense to get the computer to crunch the numbers rather than take guesses.
These are all of the violations available to the inspectors and their codes
head(levels(as.factor(inspecttbl$`VIOLATION DESCRIPTION`)))
## [1] "''''Wash hands\032 sign not posted at hand wash facility."
## [2] "Accurate thermometer not provided in refrigerated or hot holding equipment."
## [3] "Appropriately scaled metal stem-type thermometer or thermocouple not provided or used to evaluate temperatures of potentially hazardous foods during cooking, cooling, reheating and holding."
## [4] "Bulb not shielded or shatterproof, in areas where there is extreme heat, temperature changes, or where accidental contact may occur."
## [5] "Canned food product observed dented and not segregated from other consumable food items."
## [6] "Canned food product observed swollen, leaking or rusted, and not segregated from other consumable food items ."
head(levels(as.factor(inspecttbl$`VIOLATION CODE`)))
## [1] "02A" "02B" "02C" "02D" "02E" "02F"
There is a discrepancy in the number of codes and the number of descriptions. If you look closely at the descriptions you will see that a few of them have near identical descriptions. My theory is that they modified the descriptions at some point to make them more accurate, and kept the same codes for the similar descriptions.
In order to create the groupings of violations by inspection from the raw observations, I had to group them together by restaurant first, then by inspection date. I then concatenated the violations into a vector and placed all of the inspection violation vectors into a list to be processed by the association rule algorithm in the arules package.
codeVectorizer <- function(dTab){
masterList <- list()
for (i in 1:length(unique(dTab$CAMIS))){ # Cycle through inspection groups for individual resaurants
subset <- filter(dTab, CAMIS == unique(dTab$CAMIS)[i]) # Save a variable for easier reference
for (j in 1:length(unique(subset$`INSPECTION DATE`))){ # Cycle through unique inspection dates and
# append vectors of codes to master list
masterList[[length(masterList) + 1]] <-
(unique(filter(subset, `INSPECTION DATE` == unique(subset$`INSPECTION DATE`)[j])$`VIOLATION CODE`))
}
}
return(masterList)
}
Let’s run the code to create our set of “transactions”. It takes about 11 minutes. If it took substantially longer I would have worked on parallelizing the code but I didn’t find it necessary)
system.time({
aList <- codeVectorizer(inspecttbl)
})
#save(aList, file = "aList2.bin")
names(aList) <- paste0('INSP', 1:length(aList))
Before we begin we have to convert our list of vectors into a list of transactions class objects
aList <- as(aList, 'transactions')
Let’s look at the most common violations
itemFrequencyPlot((aList), cex = .8, topN = 40, type = 'absolute', col = 'darkgreen')
For the summary function built into arules, it gives us the most frequent items found in the transactional data.
summary(aList)
## transactions as itemMatrix in sparse format with
## 148958 rows (elements/itemsets/transactions) and
## 79 columns (items) and a density of 0.03904176
##
## most frequent items:
## 10F 08A 02G 04L 06D (Other)
## 62250 46737 42202 35455 30011 242776
##
## element (itemset/transaction) length distribution:
## sizes
## 0 1 2 3 4 5 6 7 8 9 10 11
## 1379 17970 45509 36667 22139 12541 6328 3223 1625 807 398 189
## 12 13 14 15 16 17
## 108 39 24 6 1 5
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 3.084 4.000 17.000
##
## includes extended item information - examples:
## labels
## 1 02A
## 2 02B
## 3 02C
##
## includes extended transaction information - examples:
## transactionID
## 1 INSP1
## 2 INSP2
## 3 INSP3
I wanted to access that vector without having to pull out the information manually. This introduced me to the concept of slots in an S4 object. The output of Summary is an S4 object and the information is saved in a slot called “itemSummary”. I discovered this by running an str on the summary itself.
str(summary(aList))
## Formal class 'summary.transactions' [package "arules"] with 7 slots
## ..@ transactionInfo:'data.frame': 3 obs. of 1 variable:
## .. ..$ transactionID: Factor w/ 148958 levels "INSP1","INSP10",..: 1 60071 71182
## ..@ Dim : int [1:2] 148958 79
## ..@ density : num 0.039
## ..@ itemSummary : Named int [1:6] 62250 46737 42202 35455 30011 242776
## .. ..- attr(*, "names")= chr [1:6] "10F" "08A" "02G" "04L" ...
## ..@ lengths : 'table' int [1:18(1d)] 1379 17970 45509 36667 22139 12541 6328 3223 1625 807 ...
## .. ..- attr(*, "dimnames")=List of 1
## .. .. ..$ sizes: chr [1:18] "0" "1" "2" "3" ...
## ..@ lengthSummary :Classes 'summaryDefault', 'table' Named num [1:6] 0 2 3 3.08 4 ...
## .. .. ..- attr(*, "names")= chr [1:6] "Min." "1st Qu." "Median" "Mean" ...
## ..@ itemInfo :'data.frame': 3 obs. of 1 variable:
## .. ..$ labels:Class 'AsIs' chr [1:3] "02A" "02B" "02C"
Another way to do this is with slotNames
slotNames(summary(aList))
## [1] "transactionInfo" "Dim" "density" "itemSummary"
## [5] "lengths" "lengthSummary" "itemInfo"
In order to access slots, you can use the slot function like so…
slot(summary(aList), "itemSummary")
## 10F 08A 02G 04L 06D (Other)
## 62250 46737 42202 35455 30011 242776
Great. Now to exract the info I wanted (the top 5 most frequent violations)
names(slot(summary(aList), "itemSummary"))[1:5]
## [1] "10F" "08A" "02G" "04L" "06D"
topViols <- names(slot(summary(aList), "itemSummary"))[1:5]
The goal of this is to see the most common violation descriptions
topDesc <- (distinct(inspecttbl[,c(11,12)], `VIOLATION CODE`)[distinct(inspecttbl[,c(11,12)], `VIOLATION CODE`)$`VIOLATION CODE` %in% topViols,])
topDesc$`VIOLATION CODE` <- factor(topDesc$`VIOLATION CODE`, levels = topViols, ordered = F)
topDesc <- topDesc[order(topDesc$`VIOLATION CODE`),][,2]
print.AsIs(topDesc)
## VIOLATION DESCRIPTION
## 1 Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
## 2 Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.
## 3 Cold food item held above 41º F (smoked fish and reduced oxygen packaged foods above 38 ºF) except during necessary preparation.
## 4 Evidence of mice or live mice present in facility's food and/or non-food areas.
## 5 Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
Association rule learning can be an immense project if you don’t constrain your search. For example, a retailer with only 100 items could potentially have 10^30 itemsets that a learner would have to evaluate.
The a priori algorithm in the arules package simplifies the search with minsup (minimum support) and minconf(minimum confidence) parameters which speeds up the process tremendously by only searching for rules with higher support and confidence thresholds.
firstRules <- apriori(aList)
##
## Parameter specification:
## confidence minval smax arem aval originalSupport support minlen maxlen
## 0.8 0.1 1 none FALSE TRUE 0.1 1 10
## target ext
## rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## apriori - find association rules with the apriori algorithm
## version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[79 item(s), 148958 transaction(s)] done [0.03s].
## sorting and recoding items ... [9 item(s)] done [0.01s].
## creating transaction tree ... done [0.06s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [1 rule(s)] done [0.00s].
## creating S4 object ... done [0.02s].
inspect(firstRules[1, ])
## lhs rhs support confidence lift
## 1 {04L} => {08A} 0.1916849 0.8053307 2.566713
#inspect(sort(firstRules, by="lift")[1, ])
This returned only one rule. Likely because the default support and confidence parameters are a bit too high for our data set.
The default behavior is to mine rules with support 0.1, confidence 0.8, and maxlen 10. Let’s lower them a bit and see what we get.
secondRules <- apriori(aList, parameter=list(support=0.006,
confidence=0.25,
minlen=2))
##
## Parameter specification:
## confidence minval smax arem aval originalSupport support minlen maxlen
## 0.25 0.1 1 none FALSE TRUE 0.006 2 10
## target ext
## rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## apriori - find association rules with the apriori algorithm
## version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[79 item(s), 148958 transaction(s)] done [0.04s].
## sorting and recoding items ... [32 item(s)] done [0.01s].
## creating transaction tree ... done [0.05s].
## checking subsets of size 1 2 3 4 done [0.01s].
## writing ... [416 rule(s)] done [0.00s].
## creating S4 object ... done [0.02s].
#inspect(secondRules[1:10, ])
inspect(sort(secondRules, by="lift")[1:10, ])
## lhs rhs support confidence lift
## 1 {02G,
## 08A,
## 10F} => {04M} 0.007243653 0.2885027 3.231187
## 2 {06C,
## 08A} => {04M} 0.014608145 0.2809555 3.146659
## 3 {02B,
## 08A} => {04M} 0.014151640 0.2695308 3.018704
## 4 {08A,
## 08C} => {04L} 0.012057090 0.7034861 2.955574
## 5 {08A,
## 10F} => {04M} 0.025154742 0.2630581 2.946211
## 6 {04H,
## 08A} => {04N} 0.014594718 0.4144900 2.941337
## 7 {04L,
## 04N,
## 06C} => {08A} 0.006686449 0.9146006 2.914972
## 8 {04L,
## 04N,
## 06D} => {08A} 0.006518616 0.9134525 2.911313
## 9 {04L,
## 04M,
## 06C} => {08A} 0.006344070 0.9104046 2.901599
## 10 {04L,
## 06C,
## 06D} => {08A} 0.008311068 0.9082905 2.894861
Let’s parse through this a bit:
Support is the fraction of transactions that contain both sides of the equation. So, for the first rule, the number of transactions that contain 02G, 08A, 10F, and 04M is very small. I have roughly 149,000 transactions, and a quick product shows that there were 1120 transactions that contained this combination of violations.
Confidence is the measure of how often items on the right hand side (rhs) appear in transactions that contain items on the left hand side(lhs). For our first rule, the we know that 04M appears in transactions with 02G, 08A, and 10F about 28% of the time.
Lift is the probability (support) of the itemset over the product of the probabilities of all items in the itemset. It is a measure of the performance of a model (association rule) at predicting or classifying cases as having an enhanced response (with respect to the population as a whole), measured against a random choice targeting model.
For our first rule, we can see that the this grouping is quite common since it has a lift of 3, even though there are three other items in the set which is bound to lower the lift since their probabilities all appear in the denomenotor of our lift equation.
We can create a tighter version of our list by raising the minimum support threshhold to 0.08. This leaves us with a list of 11 rules
thirdRules <- apriori(aList, parameter=list(support=0.08,
confidence=0.25,
minlen=2))
##
## Parameter specification:
## confidence minval smax arem aval originalSupport support minlen maxlen
## 0.25 0.1 1 none FALSE TRUE 0.08 2 10
## target ext
## rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## apriori - find association rules with the apriori algorithm
## version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[79 item(s), 148958 transaction(s)] done [0.03s].
## sorting and recoding items ... [12 item(s)] done [0.00s].
## creating transaction tree ... done [0.05s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [11 rule(s)] done [0.00s].
## creating S4 object ... done [0.02s].
#inspect(thirdRules[1:10, ])
inspect(sort(thirdRules, by="lift")[1:11, ])
## lhs rhs support confidence lift
## 1 {04L} => {08A} 0.19168490 0.8053307 2.5667127
## 2 {08A} => {04L} 0.19168490 0.6109292 2.5667127
## 3 {08A} => {04N} 0.10550625 0.3362646 2.3862278
## 4 {04N} => {08A} 0.10550625 0.7487018 2.3862278
## 5 {10B} => {10F} 0.08123095 0.4458200 1.0668024
## 6 {06D} => {10F} 0.08715880 0.4326080 1.0351876
## 7 {02G} => {08A} 0.08307040 0.2932089 0.9345017
## 8 {08A} => {02G} 0.08307040 0.2647581 0.9345017
## 9 {02G} => {10F} 0.10520415 0.3713331 0.8885629
## 10 {10F} => {02G} 0.10520415 0.2517430 0.8885629
## 11 {08A} => {10F} 0.09562427 0.3047692 0.7292822
Eight times out of ten, if you’ve got a 04L, you’re going to get an 08A
This could be useful information for investigators and restauranteurs alike.
If you know you’ve got improperly constructed food contact surfaces, you might also want to self-regulate and fix the other commonly occuring violations that usually accompany improperly constructed food contact surfaces.
Some of the most impactful rules included: 1: 04L -> 08A “Evidence of mice or live mice present in facility’s food and/or non-food areas.”
> "Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist"
2: 08A -> 04N: “Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist”
> "Filth flies or food/refuse/sewage-associated (FRSA) flies in facility’s food and/or non-food areas."
3: 10B -> 10F: “Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly.”
> "Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit."
4: 02G -> 08A: “Cold food item held above 41°F (smoked fish and Reduced Oxygen Packaged food above 38°F), except during necessary preparation.”
> "Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist"
5: 02G -> 10F: “Cold food item held above 41°F (smoked fish and Reduced Oxygen Packaged food above 38°F), except during necessary preparation.”
> "Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit."
6: 08A -> 10F: “Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist”
> "Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit."
There are a few great visualizations for association rules in the arulesViz package.
This two-key plot allows us to visualize the three parameters of interest: lift, support, and confidence, all at once. It is great for spotting outliers and identifying clusters of activity.
plot(secondRules, shading="lift", control=list(main = "Two-key plot of 416 Rules", col=sequential_hcl(20)))
The code for this next plot displays a great interactive plot, but unfortunately it only works within the R environment. It allows you to visually inspect the above plot and manually select the rules you wish you examine. It then provides the summary statistics for those rules.
sel <- plot(secondRules, measure=c("support", "lift"), shading="confidence", interactive=TRUE)
If we plot a tighter set of rules, we can represent it as an association rules matrix to get a sense of the distribution of the rules.
fourthRules <- apriori(aList, parameter=list(support=0.02,
confidence=0.25,
minlen=2))
##
## Parameter specification:
## confidence minval smax arem aval originalSupport support minlen maxlen
## 0.25 0.1 1 none FALSE TRUE 0.02 2 10
## target ext
## rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## apriori - find association rules with the apriori algorithm
## version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[79 item(s), 148958 transaction(s)] done [0.03s].
## sorting and recoding items ... [29 item(s)] done [0.01s].
## creating transaction tree ... done [0.06s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [81 rule(s)] done [0.00s].
## creating S4 object ... done [0.02s].
#inspect(fourthRules[1:10, ])
inspect(sort(fourthRules, by="lift")[1:11, ])
## lhs rhs support confidence lift
## 1 {08A,
## 10F} => {04M} 0.02515474 0.2630581 2.946211
## 2 {02G,
## 08A} => {04M} 0.02126774 0.2560207 2.867393
## 3 {04L,
## 04N} => {08A} 0.03098860 0.8699585 2.772692
## 4 {04L,
## 04M} => {08A} 0.02436257 0.8632255 2.751232
## 5 {06C,
## 08A} => {04L} 0.03345238 0.6433828 2.703061
## 6 {04L,
## 06C} => {08A} 0.03345238 0.8464413 2.697739
## 7 {04L,
## 10B} => {08A} 0.03507029 0.8448973 2.692818
## 8 {04L,
## 10F} => {08A} 0.05610306 0.8444826 2.691496
## 9 {06D,
## 08A} => {04L} 0.03517099 0.6375806 2.678684
## 10 {04L,
## 06D} => {08A} 0.03517099 0.8350335 2.661380
## 11 {02B,
## 08A} => {04L} 0.03282133 0.6251119 2.626299
plot(fourthRules, method = "grouped", control = list(k=80, col=sequential_hcl(5))) # - This is a
We can visually see that 04L and 04N tends to gravitate towards 08A, which is what our analysis above confirms.