As people and organizations start to recognize the benefits of open data, we find ourselves able to answer some of the easy questions that may have been obscured in the past. However, I think that the true value of open data lies not in our ability to wrangle data and spit out the answers to the easy questions. The real value comes when we begin to ask the second order questions. The second order questions are the ones that you didn’t know were possible until you started to understand the data. There are often many ways to interpret data and I view this as a creative endeavor. My goal for this project was to find some truths within my data that were not readily apparent at first glance. I would like to show how a little data mining can have real world implications beyond the easy questions.

I would like to get the blog post started with a visualization I created in CartoDB and Lat/Lon coordinates calculated using the Bing geocoding API. The points represent Grades handed out by the Department Of Health and Mental Hygiene over the past five years. All of the code can be found about half way down the blog post. It should be noted that this visualization is not meant to convey a specific message, it serves as more of a pretty proof of concept.

If you would like to see my video presentation on this topic it can be viewed here.
require(data.table)
require(dplyr)
require(lattice)
require(gdata)
require(ggplot2)
require(taRifx.geo)
require(gdata)
require(tidyr)
require(arules)
require(arulesViz)
require(colorspace)
## Warning: package 'colorspace' was built under R version 3.1.3
require(foreach)

For my first data science project, I worked with five years of data from the NYC Restaurant Inspection Results data set. This data is created by the NYC Department of Health and Mental Hygiene. Within this data you’ll find all of the restaurant inspections and violations and grades for 26,000 restuarants in New York City.

If you’re not familiar with the NYC Restaurant Inspection Grading system, the basic idea is that inspectors periodically visit restaurants and count up the violations. The violations all carry a certain amount of points, and the points are tallied and the restaurant receives a grade. Ultimately, the restaurant is forced to display the grade on their entryway.

Here is the score breakdown:

##  Points: Score:
##  <14     A     
##  14-27   B     
##  >27     C

Let’s get started…

Step 1: Read in the data.

inspectionDF <- fread("DOHMH_New_York_City_Restaurant_Inspection_Results.csv", header = T, 
                      sep=',', na.strings=c("NA","N/A",""), colClasses = "character")
## 
Read 0.0% of 511830 rows
Read 511830 rows and 18 (of 18) columns from 0.178 GB file in 00:00:04

I then converted the file into a dplyr data table for easier manipulation using the dplyr package

inspecttbl<-tbl_df(inspectionDF)

It is always a good idea to know what your data looks like. This one gives us a sneak peak of the entries for each column.

glimpse(inspecttbl)
## Observations: 511830
## Variables:
## $ CAMIS                 (chr) "30075445", "30075445", "30075445", "300...
## $ DBA                   (chr) "MORRIS PARK BAKE SHOP", "MORRIS PARK BA...
## $ BORO                  (chr) "BRONX", "BRONX", "BRONX", "BRONX", "BRO...
## $ BUILDING              (chr) "1007", "1007", "1007", "1007", "1007", ...
## $ STREET                (chr) "MORRIS PARK AVE", "MORRIS PARK AVE", "M...
## $ ZIPCODE               (chr) "10462", "10462", "10462", "10462", "104...
## $ PHONE                 (chr) "7188924968", "7188924968", "7188924968"...
## $ CUISINE DESCRIPTION   (chr) "Bakery", "Bakery", "Bakery", "Bakery", ...
## $ INSPECTION DATE       (chr) "02/09/2015", "03/03/2014", "10/10/2013"...
## $ ACTION                (chr) "Violations were cited in the following ...
## $ VIOLATION CODE        (chr) "06C", "10F", NA, "04L", "04N", "04C", "...
## $ VIOLATION DESCRIPTION (chr) "Food not protected from potential sourc...
## $ CRITICAL FLAG         (chr) "Critical", "Not Critical", "Not Applica...
## $ SCORE                 (chr) "6", "2", NA, "6", "6", "32", "32", "32"...
## $ GRADE                 (chr) "A", "A", NA, "A", "A", NA, NA, NA, NA, ...
## $ GRADE DATE            (chr) "02/09/2015", "03/03/2014", NA, "09/11/2...
## $ RECORD DATE           (chr) "02/12/2015", "02/12/2015", "02/12/2015"...
## $ INSPECTION TYPE       (chr) "Cycle Inspection / Initial Inspection",...

Now it is time to start tidying up my data One of the first things I did was convert the date information into date class objects that R can read using the as.Date function.

inspecttbl$"INSPECTION DATE"<-(as.Date(inspecttbl$'INSPECTION DATE', "%m/%d/%Y"))
inspecttbl$"GRADE DATE"<-(as.Date(inspecttbl$'GRADE DATE', "%m/%d/%Y"))
inspecttbl$"RECORD DATE"<-(as.Date(inspecttbl$'RECORD DATE', "%m/%d/%Y"))
class(inspecttbl$'RECORD DATE')
## [1] "Date"

While we are at it, let’s set the Grades as ordered factors

inspecttbl$GRADE <- factor(inspecttbl$GRADE, levels = c("A", "B", "C", "Z", "Not Yet Graded"), ordered = TRUE)
(levels(inspecttbl$GRADE))
## [1] "A"              "B"              "C"              "Z"             
## [5] "Not Yet Graded"

As well as the Cuisine Descriptions and Violation Codes and Descriptions

inspecttbl$`CUISINE DESCRIPTION` <- factor(inspecttbl$`CUISINE DESCRIPTION`)
#inspecttbl$`VIOLATION CODE` <- factor(inspecttbl$`VIOLATION CODE`)
#inspecttbl$`VIOLATION DESCRIPTION` <- factor(inspecttbl$`VIOLATION DESCRIPTION`)

head(levels(inspecttbl$`CUISINE DESCRIPTION`))
## [1] "Afghan"     "African"    "American "  "Armenian"   "Asian"     
## [6] "Australian"
#levels(inspecttbl$`VIOLATION CODE`)
#levels(inspecttbl$`VIOLATION DESCRIPTION`)

Let’s set the Critical flag as an ordered factor

inspecttbl$`CRITICAL FLAG` <-   factor(inspecttbl$`CRITICAL FLAG`, levels = c("Critical", "Not Critical", "Not Applicable"), ordered = T)
levels(inspecttbl$`CRITICAL FLAG`)
## [1] "Critical"       "Not Critical"   "Not Applicable"

Let’s now set the Identifier and scores as numeric vectors

inspecttbl$CAMIS <- as.numeric(inspecttbl$CAMIS)
inspecttbl$SCORE <- as.numeric(inspecttbl$SCORE)
class(inspecttbl$CAMIS)
## [1] "numeric"
class(inspecttbl$SCORE)
## [1] "numeric"
The next step is to explore the data and clean it up when possible.

The number of distinct restaurants

n_distinct(inspecttbl$CAMIS)
## [1] 25396

Next lets take a look at how far back these inspections go

range(inspecttbl$`INSPECTION DATE`)
## [1] "1900-01-01" "2015-02-11"

That’s strange - This brings me to my next point - Documentation. The readme file can sometimes be your best friend. I wouldn’t have been able to work this data at all had there been no documentation.

When you look at the documentation you’ll see that they used the January 1st, 1900 date as a marker for restaurants that haven’t been inspected yet

So, let’s clean those rows in which the date is listed as “1900-01-01”

inspecttbl<-filter(inspecttbl, `INSPECTION DATE` != "1900-01-01")

Great, this cuts out 797 entries. Let’s see what our range looks like now

range(inspecttbl$`INSPECTION DATE`)
## [1] "2008-09-18" "2015-02-11"

That’s more like it.

If we take another look at the documentation we’ll see that some of these entries do not provide us with the information we are looking for. For instance, some of the inspection types inappropriately color our data. Trans Fat and Calorie type inspections shouldn’t count because they are not gradable and they are often performed before a scheduled inspection. Thus the establishments were not prepared specifically for food safety inspections. Compliance, Special Program, and Administrative violations similarly color the data, and thus need to be excluded.

At the end of the day, there are only four gradable inspection types. Since what we are interested in is the report card data, we can select for rows that only contain gradable types. Those types are:

Cycle Inspection/Initial Inspection,
Cycle Inspection/Re-Inspection,
Pre-Permit (Operational)/Initial Inspection,
Pre-Permit (Operational)/Re-Inspection)

Now, before running a filter for these four types, let’s make sure there are no subtle variations in spelling of these types otherwise our filter would uneccessarily exclude the variations. We can use the levels and the as.factors functions to facilitate this inquiry.

levels(as.factor(inspecttbl$`INSPECTION TYPE`)) 
##  [1] "Administrative Miscellaneous / Compliance Inspection"       
##  [2] "Administrative Miscellaneous / Initial Inspection"          
##  [3] "Administrative Miscellaneous / Re-inspection"               
##  [4] "Administrative Miscellaneous / Reopening Inspection"        
##  [5] "Administrative Miscellaneous / Second Compliance Inspection"
##  [6] "Calorie Posting / Compliance Inspection"                    
##  [7] "Calorie Posting / Initial Inspection"                       
##  [8] "Calorie Posting / Re-inspection"                            
##  [9] "Calorie Posting / Second Compliance Inspection"             
## [10] "Cycle Inspection / Compliance Inspection"                   
## [11] "Cycle Inspection / Initial Inspection"                      
## [12] "Cycle Inspection / Re-inspection"                           
## [13] "Cycle Inspection / Reopening Inspection"                    
## [14] "Cycle Inspection / Second Compliance Inspection"            
## [15] "Inter-Agency Task Force / Initial Inspection"               
## [16] "Inter-Agency Task Force / Re-inspection"                    
## [17] "Pre-permit (Non-operational) / Compliance Inspection"       
## [18] "Pre-permit (Non-operational) / Initial Inspection"          
## [19] "Pre-permit (Non-operational) / Re-inspection"               
## [20] "Pre-permit (Non-operational) / Second Compliance Inspection"
## [21] "Pre-permit (Operational) / Compliance Inspection"           
## [22] "Pre-permit (Operational) / Initial Inspection"              
## [23] "Pre-permit (Operational) / Re-inspection"                   
## [24] "Pre-permit (Operational) / Reopening Inspection"            
## [25] "Pre-permit (Operational) / Second Compliance Inspection"    
## [26] "Smoke-Free Air Act / Complaint (Initial Inspection)"        
## [27] "Smoke-Free Air Act / Compliance Inspection"                 
## [28] "Smoke-Free Air Act / Initial Inspection"                    
## [29] "Smoke-Free Air Act / Limited Inspection"                    
## [30] "Smoke-Free Air Act / Re-inspection"                         
## [31] "Smoke-Free Air Act / Second Compliance Inspection"          
## [32] "Trans Fat / Compliance Inspection"                          
## [33] "Trans Fat / Initial Inspection"                             
## [34] "Trans Fat / Re-inspection"                                  
## [35] "Trans Fat / Second Compliance Inspection"

It looks like these factors are consistently labeled - especially the ones we want, so let’s focus our data on the gradable inspection types.

gradables <- c("Cycle Inspection / Initial Inspection" , "Cycle Inspection / Re-inspection", 
               "Pre-permit (Operational) / Initial Inspection", "Pre-permit (Operational) / Re-inspection")

inspecttbl <- (filter(inspecttbl, `INSPECTION TYPE` %in% gradables))

Let’s confirm that the filter worked as anticipated

levels(as.factor(inspecttbl$`INSPECTION TYPE`))
## [1] "Cycle Inspection / Initial Inspection"        
## [2] "Cycle Inspection / Re-inspection"             
## [3] "Pre-permit (Operational) / Initial Inspection"
## [4] "Pre-permit (Operational) / Re-inspection"

Great - we can see that what we are left with are only the rows that contain gradable inspection types.

We still have approximately 461,000 observations. This is good.

However, if we look into these observations, we see that not every entry records a score. Of the 416,000 observations, only a fraction had usable grades since each violation was given its own observation row. When a restaurant gets an inspection, only one of the rows associated with that inspection date might have a grade. Let’s take a look at what happens when we filter our results for the main observation rows that contain actual grades.

grades <- c("A", "B", "C")
nrow(filter(inspecttbl, GRADE %in% grades))
## [1] 216868

We see that we have about 217,000 entries with actual grades in them. Not bad

When we take a look at our datatable, it can be difficult to get a sense of the big picture on a per restaurant basis since there is almost too much information. Each violation is labled as its own observation, making the data difficult to read. So, let’s create a new data frame, one that compounds the violations into one observation per restaurant per inspection day. To do that, I used the select function to remove superfluous columns and the distinct function to aggregate rows with identical information.

shrunkenData <- (distinct(select(inspecttbl, CAMIS, DBA, BORO, BUILDING, STREET, ZIPCODE, `CUISINE DESCRIPTION`, `INSPECTION DATE`, ACTION, SCORE, GRADE, `INSPECTION TYPE`)))

We can verify that each inspection date has one observation by peeking into the data.

shrunkenData$`INSPECTION DATE`[1:20]
##  [1] "2015-02-09" "2014-03-03" "2013-09-11" "2013-08-14" "2013-01-24"
##  [6] "2012-12-31" "2014-12-30" "2014-11-13" "2014-07-01" "2014-06-05"
## [11] "2013-04-30" "2012-05-08" "2014-09-06" "2013-07-22" "2012-07-31"
## [16] "2011-12-29" "2011-12-15" "2014-06-10" "2013-06-05" "2012-04-13"

A quick nrow shows us that I have 149,000 unique inspections in my data now. nrow(shrunkenData)

Now would be a good time to take a look at our data and ask which restaurants posted the absolute worst scores but were somehow allowed to stay open to receive a grade:

print(worstOfList <- arrange(filter(shrunkenData, GRADE %in% grades), desc(SCORE))[1:20,c(2,10)])
## Source: local data frame [20 x 2]
## 
##                                 DBA SCORE
## 1          MURALS ON 54/RANDOLPHS'S   131
## 2                      BELLA NAPOLI    98
## 3             BALUCHI'S INDIAN FOOD    98
## 4                            GANDHI    92
## 5               CONCRETE RESTAURANT    90
## 6  WEST 79TH STREET BOAT BASIN CAFE    89
## 7                  D & Y RESTAURANT    86
## 8                     SPICY SHALLOT    84
## 9                   BISTRO CATERERS    84
## 10                           CAFE R    82
## 11           LA POTENCIA RESTAURANT    82
## 12                           ANELLA    81
## 13                       B.B. KINGS    80
## 14                     LA TRATTORIA    79
## 15             LEXINGTON RESTAURANT    79
## 16                        MARS CAFE    78
## 17                    GAL BI MA EUL    78
## 18    CHEIKH UMAR FUTIYU RESTAURANT    78
## 19     FORTUNATO BROS CAFE & BAKERY    77
## 20                   MIDTOWN BUFFET    77

The next plot shows our Worst of the Worst restaurants in graphical form, with the minimum thresholds for B’s and C’s labeled as yellow and red lines, respectively.

worstOfList$DBA <- factor(worstOfList$DBA, levels = rev(worstOfList$DBA))
ggplot(worstOfList, aes(x = `DBA`, y = `SCORE`)) +
  geom_bar(stat = 'identity', aes(fill = SCORE)) +
  scale_fill_gradient(
    low = "lightgreen",
    high = "green3", guide = 'none') +
  geom_hline(yintercept = 14, color = "yellow") +
  geom_hline(yintercept = 27, color = "red") +
  theme(text = element_text(size=10),
       axis.text.x = element_text(angle=45, vjust=1, color = "black")) +
  labs(title="Worst of the Worst")

In order to explore the data a bit more, I thought it would be interesting to look at the average grades for restaurants when grouped together by cuisine.

groupCuisine <- group_by(shrunkenData, `CUISINE DESCRIPTION`)
print(cuisineSummary <- summarize(groupCuisine,
    "A's" = sum(GRADE == "A", na.rm = T)/sum(GRADE == "A" | GRADE == "B" | GRADE == "C", na.rm = T),
    "B's" = sum(GRADE == "B", na.rm = T)/sum(GRADE == "A" | GRADE == "B" | GRADE == "C", na.rm = T),
    "C's" = sum(GRADE == "C", na.rm = T)/sum(GRADE == "A" | GRADE == "B" | GRADE == "C", na.rm = T),
    "Totals" = sum(GRADE == "A" | GRADE == "B" | GRADE == "C", na.rm = T)
    ))
## Source: local data frame [84 x 5]
## 
##    CUISINE DESCRIPTION       A's        B's         C's Totals
## 1               Afghan 0.8636364 0.11363636 0.022727273     44
## 2              African 0.7208333 0.21666667 0.062500000    240
## 3            American  0.8615154 0.11603599 0.022448620  22674
## 4             Armenian 0.9230769 0.06593407 0.010989011     91
## 5                Asian 0.7355534 0.23016650 0.034280118   1021
## 6           Australian 0.9767442 0.02325581 0.000000000     43
## 7      Bagels/Pretzels 0.8871951 0.09756098 0.015243902    656
## 8               Bakery 0.8330888 0.14636831 0.020542920   2726
## 9          Bangladeshi 0.7000000 0.29090909 0.009090909    110
## 10            Barbecue 0.9069767 0.06976744 0.023255814    172
## ..                 ...       ...        ...         ...    ...

Here are the top 20 Cuisines with the highest proportion of A’s

arrange(cuisineSummary, desc(`A's`))[1:20,]
## Source: local data frame [20 x 5]
## 
##                CUISINE DESCRIPTION       A's        B's         C's Totals
## 1                      Californian 1.0000000 0.00000000 0.000000000      2
## 2                          Chilean 1.0000000 0.00000000 0.000000000      1
## 3                     Creole/Cajun 1.0000000 0.00000000 0.000000000      3
## 4                         Hawaiian 1.0000000 0.00000000 0.000000000      6
## 5                       Australian 0.9767442 0.02325581 0.000000000     43
## 6        Not Listed/Not Applicable 0.9714286 0.02857143 0.000000000     35
## 7                            Other 0.9594272 0.03818616 0.002386635    419
## 8                           Donuts 0.9532863 0.04073873 0.005975014   1841
## 9  Ice Cream, Gelato, Yogurt, Ices 0.9521436 0.04087737 0.006979063   1003
## 10                          Salads 0.9508197 0.04918033 0.000000000    122
## 11               Fruits/Vegetables 0.9473684 0.05263158 0.000000000     19
## 12  Juice, Smoothies, Fruit Salads 0.9449794 0.04951857 0.005502063    727
## 13                      Sandwiches 0.9416717 0.05171377 0.006614552   1663
## 14              Soups & Sandwiches 0.9411765 0.04705882 0.011764706    170
## 15                Hotdogs/Pretzels 0.9400000 0.06000000 0.000000000     50
## 16                 Café/Coffee/Tea 0.9373446 0.05469915 0.007956241   4022
## 17                       Ethiopian 0.9322034 0.06779661 0.000000000     59
## 18  Sandwiches/Salads/Mixed Buffet 0.9273183 0.06015038 0.012531328    798
## 19                      Hamburgers 0.9255014 0.06074499 0.013753582   1745
## 20                        Armenian 0.9230769 0.06593407 0.010989011     91

Here are the top 20 Cuisines with the highest proportion of C’s

arrange(cuisineSummary, desc(`C's`))[c(1:20),]
## Source: local data frame [20 x 5]
## 
##                                                 CUISINE DESCRIPTION
## 1                                                           Iranian
## 2                                                  Chinese/Japanese
## 3                                                     Chinese/Cuban
## 4                                                           African
## 5                                                            Korean
## 6                                                  Eastern European
## 7                                                         Pakistani
## 8                                                            Indian
## 9  Latin (Cuban, Dominican, Puerto Rican, South & Central American)
## 10                                                          English
## 11                                                         Peruvian
## 12                                                             Thai
## 13                                                        Brazilian
## 14                                                     Delicatessen
## 15                                                     Scandinavian
## 16                                                          Russian
## 17                                                          Spanish
## 18                                                       Indonesian
## 19                                                          Mexican
## 20                                                         Japanese
## Variables not shown: A's (dbl), B's (dbl), C's (dbl), Totals (int)

It turns out there is another wrinkle in the grading system. As a sort of mercy rule, the city’s policy on periodic initial inspections is that restaurants do not receive grades less than an A upon their initial inspection. Here is how it works, if an establishment scores less than 14 points on their intial inspection, they get their A and everyone is happy.

However, if the restaurant scores in the B or C range on their initial inspection, this score is not reported and they are given a grace period of one week to fix their violations before they are given their reinspection score. Whatever score they receive upon their re-inspection then becomes the score that sticks.

When you think about it, this second chance system actually seems like a well thought out policy meant to give restaurants a second opportunity to earn an A. Rather than manipulating the grading system as a means to generate revenue, this type of regulation seems well suited toward actually raising the standards for health and food safety for the benefit of New Yorkers.

What this means for people looking at the data is that any proportional analysis of the grades must be done without initial inspection scores since the only grades that are recorded for the initial inspection are either A’s or NA’s.

However, if we only look at reinspection scores, then we are effectively discarding all the of the A’s that were awarded during initial inspections. So, this means that caveats are necesssary when looking summary statistics.

Let’s look at a summary of all final grades:

summary(shrunkenData$GRADE)
##              A              B              C              Z Not Yet Graded 
##          74815          12276           2289           1354            191 
##           NA's 
##          58035

Summary of all reinspections that resulted in a final score:

summary(filter(shrunkenData, GRADE %in% grades, `INSPECTION TYPE` != "Cycle Inspection / Initial Inspection", `INSPECTION TYPE` != "Pre-permit (Operational) / Initial Inspection" )$GRADE)
##              A              B              C              Z Not Yet Graded 
##          39530          12276           2289              0              0

This means that only 35,000 establishments scored an A on their first try

To see the ramifications of this rule, let’s first look at a bar chart of all of the grades of restaurants allowed to stay open by the DOHMH

filter1 <- filter(shrunkenData, GRADE %in% grades)

Let’s reset the factors and plot the results

filter(shrunkenData, GRADE %in% grades)$GRADE%>%
  factor(levels = c("A", "B", "C"), ordered= T) %>%
    table()%>%
      barchart(horizontal = F, main = "Frequency of Grades for Establishments Allowed to Stay Open")

So, if you want to know the overall proportion of grades of open restaurants, chances are that you’ll find an A level restaurant without too much work.

Now this is where things get interesting.

If we wanted to see what this chart would look like without the city’s two strikes policy, we can estimate the grades based upon the unpublished results of the initial inspections and recompute this chart using the new theoretical grades

First we filter on the initial results then pull out the Grade data

firstStrike <- select(filter(shrunkenData, `INSPECTION TYPE`=="Cycle Inspection / Initial Inspection" | `INSPECTION TYPE` == "Pre-permit (Operational) / Initial Inspection"), CAMIS, SCORE, `INSPECTION DATE`)

Then bin the results into Grade types and mutate the grades onto to the column of scores

GRADEHYP <- cut(firstStrike$SCORE, breaks = c(0, 13, 27, 200), 
                   labels = c("A", "B", "C"), include.lowest = T)

firstGrades <- (mutate (firstStrike,GRADEHYP))
shrunkenData <- left_join(shrunkenData, firstGrades, by = c("CAMIS", "SCORE", "INSPECTION DATE"))

Now we plot using the lattice barchart and get VERY different results

firstGrades$GRADE%>%
  factor(levels = c("A", "B", "C"), ordered= T) %>%
    table()%>%
      barchart(horizontal = F, main = "Frequency of Estimated Grades upon Initial Inspection")

So if they actually gave out grades on their initial inspections, this is what it would look like.

It really lets you know how much an establishment can shape up once they are given a warning

This led me to my next question, which was “How far are these restaurants falling each year, from the date of their A inspection to their next initial inspection.

In a sense, this is the information that gets to the heart of the regulation. If restaurants are consistently scoring below their A rating, but they magically turn things around for their second try (their reinspection), then that ‘A’ rating really isn’t a very good metric for the kind of conditions that you can expect from that establishment. And whatever practices they use to score an A on their reinspection is not what they practice throughout the year.

So, in order to find the restaurants that have figured out how to game the system, let’s first pull all of the A’s that were handed out.

justAs <- filter(shrunkenData, GRADE =="A")

table(justAs$`INSPECTION TYPE`)
## 
##         Cycle Inspection / Initial Inspection 
##                                         32392 
##              Cycle Inspection / Re-inspection 
##                                         36180 
## Pre-permit (Operational) / Initial Inspection 
##                                          2893 
##      Pre-permit (Operational) / Re-inspection 
##                                          3350

We can see that there are more reinspections than there are initial inspections. The establishments that earned an A on their initial inspections are not the focus of the current inquiry here. What I am interested in is the average amount that restaurants fall after receiving their A rating.

By looking at the average difference between a restaurant’s A score and their next score, we can get a metric for the overall efficacy of the law itself.

system.time({
diff = vector()
pair1 = vector()
pair2 = vector()
for (i in 1:(nrow(shrunkenData)-1)){
  if ({is.na(shrunkenData[i,]$GRADE) == FALSE} & {shrunkenData[i,]$GRADE == 'B'}){
    if (shrunkenData[i,]$CAMIS == shrunkenData[i+1,]$CAMIS){
      diff <- append(diff, shrunkenData[i,]$SCORE - shrunkenData[i+1,]$SCORE)
      pair1 <- append(pair1, shrunkenData[i,]$SCORE)
      pair2 <- append(pair2, shrunkenData[i+1,]$SCORE)
    }
  }
}
})
save(diff, file = "diff.bin")
save(pair1, file = "pair1.bin")
save(pair2, file = "pair2.bin")
#should take an hour

If we take the mean of the differences, we get a result of -8.35.

Unfortunately this result is not quite as conclusive as I had hoped it would be.

summary(diff)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -146.000  -14.000   -6.000   -8.359    0.000   14.000
boxplot(diff, horizontal = T)

hist(x = diff, col = "green3", main = "Histogram of Score Drops", xlab = "Score Drop")
abline(v = mean(diff), lwd = 4, )
abline(v = 0, lwd= 4)

sd(diff) 
## [1] 11.62465
t.test(pair1, pair2, alternative = "two.sided", paired = TRUE)
## 
##  Paired t-test
## 
## data:  pair1 and pair2
## t = -183.4119, df = 65055, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.448512 -8.269854
## sample estimates:
## mean of the differences 
##               -8.359183

The paired t-test tells us that restaurants clearly lower their standards after receiving an A. However, the mean difference of -8 is really not all that terrible. It is entirely possible to score an A on one review, drop 8 points and still receive an A on your next review. If the mean had been significantly lower, it would have been easier to make a conclusion about the overall efficacy of the regulations. As such, I would not suggest that the regulations, as crafted, require revision.

Visualizing the data

I wanted to represent the locations of the restaurants overlayed onto a map of the city. There are a number of questions you can answer by have a visual representation of the data, but the first step was to take the condensed address data and get a latitude and longitude ID for each establishment. Step one was to first fold the data into a dataframe containing only the unique addresses for which we needed coordinate information and their CAMIS ID.

When we look at the data, we see that a number of addresses have leading and trailing whitespace, so we have to use the trim command in the gdata package to clean up the data and make it pretty for our geocoding API. Next, we paste the relevant rows together into one string to feed to the API. Then drop the duplicates since we want as few calls to the API as possible.

allAddresses <- unique(cbind(shrunkenData$CAMIS, paste(paste(paste(paste(trim(shrunkenData$BUILDING), 
                      trim(shrunkenData$STREET), sep = " "), trim(shrunkenData$BORO), sep = ", "),
                      "NY", sep = " "), trim(shrunkenData$ZIPCODE), sep = " ")))

This leaves us with 24,311 entries. Google maps has a geocoding API, but their daily limit is capped at 2,500 requests, so I chose to go with Bing for my geocoding needs. They have a higher limit.

At the time of this writing, Cran did not contain a version of taRifx.geo that was compatible with bing searches. In order to get it to work, I had to install gdal from homebrew, brew install gdal, then install the rgeos package from source. THEN install the new tarifx package using this command:

install.packages('rgeos', type='source')
install.packages('rgdal', type='source')
devtools::install_github("gsk3/taRifx.geo")

Note: this is on OS X Yosemite.

Let’s start geocoding the addresses. I split them up into manageable chunks to make it easier to keep track of them if I hit the daily limits.

Split the rows

first <- allAddresses[1:2000,2]
second <- allAddresses[2001:10000,2]
third <- allAddresses[10001:20000,2]
fourth <- allAddresses[20001:24311,2]
# geocode the vectors
#options(BingMapsKey="Aul-REDACTED")
#coord1 = lapply(first, function(x) geocode(x, service = 'bing', returntype = 'coordinates'))
#coord2 = lapply(second, function(x) geocode(x, service = 'bing', returntype = 'coordinates'))
#system.time({coord3 = lapply(third, function(x) geocode(x, service = 'bing', returntype = 'coordinates'))})
#coord4 = lapply(fourth, function(x) geocode(x, service = 'bing', returntype = 'coordinates'))

Since the data that is returned from the geocode function is represented as two rows of multiple columns, we need to transpose the data into a two column data frame, then put it back together in order.

temp1 <-(gather(data.frame(coord1)))
odds <- seq(1, by=2, len = 2000)
temp1 <- (data.frame(cbind(temp1[odds,2], temp1[odds+1, 2])))

temp2 <-(gather(data.frame(coord2)))
odds <- seq(1, by=2, len = 8000)
temp2 <- (data.frame(cbind(temp2[odds,2], temp2[odds+1, 2])))

temp3 <-(gather(data.frame(coord3)))
odds <- seq(1, by=2, len = 10000)
temp3 <- (data.frame(cbind(temp3[odds,2], temp3[odds+1, 2])))

temp4 <-(gather(data.frame(coord4)))
odds <- seq(1, by=2, len = 4311)
temp4 <- (data.frame(cbind(temp4[odds,2], temp4[odds+1, 2])))

coords <- bind_rows(temp1, temp2, temp3, temp4)
colnames(coords)<- (c("lat", "lon"))

write.csv(file = "icoords.csv", coords)

Let’s take a look at the final product

head(coords)

Now we have to combine the coordinates with their respectives CAMIS IDs. Then we have to combine the coordinate data with the rest of my grade data I used dplyrs left join function to join on CAMIS ID.

allAddresses <- bind_cols(data.frame(allAddresses[,1]), coords)
colnames(allAddresses)<- c("CAMIS", "LAT", "LON")
allAddresses$CAMIS<- as.numeric(as.character(allAddresses$CAMIS))
View(left_join(shrunkenData, allAddresses, by = "CAMIS"))
shrunkenData <- left_join(shrunkenData, allAddresses, by = "CAMIS")
#write.csv(file = "shrunkenData.csv", shrunkenData)

At this point I thought it would be fun to visualize the geocodes of the restaurants.

I input the data into a cartodb map and plotted the grades as they were awarded over time. In the end, the results are not very instructive. They are mainly there as a proof of concept.

CartoDB Link - http://cdb.io/1D4DACe

CartoDB Embedded html

Association Rule Learning:

Using violation types

I was curious to see whether some of the violations commonly occur in tandem with other violations. For instance, if an inspector spots evidence of mice, do they usually also flag them for evidence of rats.

Or maybe if a food contact surface is not properly maintained, that surface might also be improperly constructed.

Since we have a machine learning algorithm that can do this for us, it probably makes more sense to get the computer to crunch the numbers rather than take guesses.

Onto the machine learning…

These are all of the violations available to the inspectors and their codes

head(levels(as.factor(inspecttbl$`VIOLATION DESCRIPTION`)))
## [1] "''''Wash hands\032 sign not posted at hand wash facility."                                                                                                                                    
## [2] "Accurate thermometer not provided in refrigerated or hot holding equipment."                                                                                                                  
## [3] "Appropriately scaled metal stem-type thermometer or thermocouple not provided or used to evaluate temperatures of potentially hazardous foods during cooking, cooling, reheating and holding."
## [4] "Bulb not shielded or shatterproof, in areas where there is extreme heat, temperature changes, or where accidental contact may occur."                                                         
## [5] "Canned food product observed dented and not segregated from other consumable food items."                                                                                                     
## [6] "Canned food product observed swollen, leaking or rusted, and not segregated from other consumable food items ."
head(levels(as.factor(inspecttbl$`VIOLATION CODE`)))
## [1] "02A" "02B" "02C" "02D" "02E" "02F"

There is a discrepancy in the number of codes and the number of descriptions. If you look closely at the descriptions you will see that a few of them have near identical descriptions. My theory is that they modified the descriptions at some point to make them more accurate, and kept the same codes for the similar descriptions.

In order to create the groupings of violations by inspection from the raw observations, I had to group them together by restaurant first, then by inspection date. I then concatenated the violations into a vector and placed all of the inspection violation vectors into a list to be processed by the association rule algorithm in the arules package.

codeVectorizer <- function(dTab){
  masterList <- list()
  for (i in 1:length(unique(dTab$CAMIS))){ # Cycle through inspection groups for individual resaurants 
    subset <- filter(dTab, CAMIS == unique(dTab$CAMIS)[i]) # Save a variable for easier reference
    for (j in 1:length(unique(subset$`INSPECTION DATE`))){ # Cycle through unique inspection dates and 
                                                           # append vectors of codes to master list
      masterList[[length(masterList) + 1]] <- 
        (unique(filter(subset, `INSPECTION DATE` == unique(subset$`INSPECTION DATE`)[j])$`VIOLATION CODE`))
    }      
  }
  return(masterList)
}

Let’s run the code to create our set of “transactions”. It takes about 11 minutes. If it took substantially longer I would have worked on parallelizing the code but I didn’t find it necessary)

system.time({
  aList <- codeVectorizer(inspecttbl)
})
#save(aList, file = "aList2.bin")

names(aList) <- paste0('INSP', 1:length(aList))

Before we begin we have to convert our list of vectors into a list of transactions class objects

aList <- as(aList, 'transactions')

Let’s look at the most common violations

itemFrequencyPlot((aList), cex = .8, topN = 40, type = 'absolute', col = 'darkgreen')

An exercise in slots using the summary function:

For the summary function built into arules, it gives us the most frequent items found in the transactional data.

summary(aList)
## transactions as itemMatrix in sparse format with
##  148958 rows (elements/itemsets/transactions) and
##  79 columns (items) and a density of 0.03904176 
## 
## most frequent items:
##     10F     08A     02G     04L     06D (Other) 
##   62250   46737   42202   35455   30011  242776 
## 
## element (itemset/transaction) length distribution:
## sizes
##     0     1     2     3     4     5     6     7     8     9    10    11 
##  1379 17970 45509 36667 22139 12541  6328  3223  1625   807   398   189 
##    12    13    14    15    16    17 
##   108    39    24     6     1     5 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   3.084   4.000  17.000 
## 
## includes extended item information - examples:
##   labels
## 1    02A
## 2    02B
## 3    02C
## 
## includes extended transaction information - examples:
##   transactionID
## 1         INSP1
## 2         INSP2
## 3         INSP3

I wanted to access that vector without having to pull out the information manually. This introduced me to the concept of slots in an S4 object. The output of Summary is an S4 object and the information is saved in a slot called “itemSummary”. I discovered this by running an str on the summary itself.

str(summary(aList))
## Formal class 'summary.transactions' [package "arules"] with 7 slots
##   ..@ transactionInfo:'data.frame':  3 obs. of  1 variable:
##   .. ..$ transactionID: Factor w/ 148958 levels "INSP1","INSP10",..: 1 60071 71182
##   ..@ Dim            : int [1:2] 148958 79
##   ..@ density        : num 0.039
##   ..@ itemSummary    : Named int [1:6] 62250 46737 42202 35455 30011 242776
##   .. ..- attr(*, "names")= chr [1:6] "10F" "08A" "02G" "04L" ...
##   ..@ lengths        : 'table' int [1:18(1d)] 1379 17970 45509 36667 22139 12541 6328 3223 1625 807 ...
##   .. ..- attr(*, "dimnames")=List of 1
##   .. .. ..$ sizes: chr [1:18] "0" "1" "2" "3" ...
##   ..@ lengthSummary  :Classes 'summaryDefault', 'table'  Named num [1:6] 0 2 3 3.08 4 ...
##   .. .. ..- attr(*, "names")= chr [1:6] "Min." "1st Qu." "Median" "Mean" ...
##   ..@ itemInfo       :'data.frame':  3 obs. of  1 variable:
##   .. ..$ labels:Class 'AsIs'  chr [1:3] "02A" "02B" "02C"

Another way to do this is with slotNames

slotNames(summary(aList))
## [1] "transactionInfo" "Dim"             "density"         "itemSummary"    
## [5] "lengths"         "lengthSummary"   "itemInfo"

In order to access slots, you can use the slot function like so…

slot(summary(aList), "itemSummary")
##     10F     08A     02G     04L     06D (Other) 
##   62250   46737   42202   35455   30011  242776

Great. Now to exract the info I wanted (the top 5 most frequent violations)

names(slot(summary(aList), "itemSummary"))[1:5]
## [1] "10F" "08A" "02G" "04L" "06D"
topViols <- names(slot(summary(aList), "itemSummary"))[1:5]

The goal of this is to see the most common violation descriptions

topDesc <- (distinct(inspecttbl[,c(11,12)], `VIOLATION CODE`)[distinct(inspecttbl[,c(11,12)], `VIOLATION CODE`)$`VIOLATION CODE` %in% topViols,])
topDesc$`VIOLATION CODE` <- factor(topDesc$`VIOLATION CODE`, levels = topViols, ordered = F)
topDesc <- topDesc[order(topDesc$`VIOLATION CODE`),][,2]
print.AsIs(topDesc)
##                                                                                                                                                                                                                                                             VIOLATION DESCRIPTION
## 1 Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
## 2                                                                                                                                              Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.
## 3                                                                                                                                                Cold food item held above 41º F (smoked fish and reduced oxygen packaged foods above 38 ºF) except during necessary preparation.
## 4                                                                                                                                                                                                 Evidence of mice or live mice present in facility's food and/or non-food areas.
## 5                                                                                                                                  Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
Let’s get on to the Machine Learning

Association rule learning can be an immense project if you don’t constrain your search. For example, a retailer with only 100 items could potentially have 10^30 itemsets that a learner would have to evaluate.

The a priori algorithm in the arules package simplifies the search with minsup (minimum support) and minconf(minimum confidence) parameters which speeds up the process tremendously by only searching for rules with higher support and confidence thresholds.

firstRules <- apriori(aList)
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport support minlen maxlen
##         0.8    0.1    1 none FALSE            TRUE     0.1      1     10
##  target   ext
##   rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## apriori - find association rules with the apriori algorithm
## version 4.21 (2004.05.09)        (c) 1996-2004   Christian Borgelt
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[79 item(s), 148958 transaction(s)] done [0.03s].
## sorting and recoding items ... [9 item(s)] done [0.01s].
## creating transaction tree ... done [0.06s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [1 rule(s)] done [0.00s].
## creating S4 object  ... done [0.02s].
inspect(firstRules[1, ])
##   lhs      rhs     support confidence     lift
## 1 {04L} => {08A} 0.1916849  0.8053307 2.566713
#inspect(sort(firstRules, by="lift")[1, ])

This returned only one rule. Likely because the default support and confidence parameters are a bit too high for our data set.

The default behavior is to mine rules with support 0.1, confidence 0.8, and maxlen 10. Let’s lower them a bit and see what we get.

secondRules <- apriori(aList, parameter=list(support=0.006,
                                                 confidence=0.25, 
                                                 minlen=2))
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport support minlen maxlen
##        0.25    0.1    1 none FALSE            TRUE   0.006      2     10
##  target   ext
##   rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## apriori - find association rules with the apriori algorithm
## version 4.21 (2004.05.09)        (c) 1996-2004   Christian Borgelt
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[79 item(s), 148958 transaction(s)] done [0.04s].
## sorting and recoding items ... [32 item(s)] done [0.01s].
## creating transaction tree ... done [0.05s].
## checking subsets of size 1 2 3 4 done [0.01s].
## writing ... [416 rule(s)] done [0.00s].
## creating S4 object  ... done [0.02s].
#inspect(secondRules[1:10, ])
inspect(sort(secondRules, by="lift")[1:10, ])
##    lhs      rhs       support confidence     lift
## 1  {02G,                                         
##     08A,                                         
##     10F} => {04M} 0.007243653  0.2885027 3.231187
## 2  {06C,                                         
##     08A} => {04M} 0.014608145  0.2809555 3.146659
## 3  {02B,                                         
##     08A} => {04M} 0.014151640  0.2695308 3.018704
## 4  {08A,                                         
##     08C} => {04L} 0.012057090  0.7034861 2.955574
## 5  {08A,                                         
##     10F} => {04M} 0.025154742  0.2630581 2.946211
## 6  {04H,                                         
##     08A} => {04N} 0.014594718  0.4144900 2.941337
## 7  {04L,                                         
##     04N,                                         
##     06C} => {08A} 0.006686449  0.9146006 2.914972
## 8  {04L,                                         
##     04N,                                         
##     06D} => {08A} 0.006518616  0.9134525 2.911313
## 9  {04L,                                         
##     04M,                                         
##     06C} => {08A} 0.006344070  0.9104046 2.901599
## 10 {04L,                                         
##     06C,                                         
##     06D} => {08A} 0.008311068  0.9082905 2.894861

Let’s parse through this a bit:

Support is the fraction of transactions that contain both sides of the equation. So, for the first rule, the number of transactions that contain 02G, 08A, 10F, and 04M is very small. I have roughly 149,000 transactions, and a quick product shows that there were 1120 transactions that contained this combination of violations.

Confidence is the measure of how often items on the right hand side (rhs) appear in transactions that contain items on the left hand side(lhs). For our first rule, the we know that 04M appears in transactions with 02G, 08A, and 10F about 28% of the time.

Lift is the probability (support) of the itemset over the product of the probabilities of all items in the itemset. It is a measure of the performance of a model (association rule) at predicting or classifying cases as having an enhanced response (with respect to the population as a whole), measured against a random choice targeting model.

For our first rule, we can see that the this grouping is quite common since it has a lift of 3, even though there are three other items in the set which is bound to lower the lift since their probabilities all appear in the denomenotor of our lift equation.

We can create a tighter version of our list by raising the minimum support threshhold to 0.08. This leaves us with a list of 11 rules

thirdRules <- apriori(aList, parameter=list(support=0.08,
                                             confidence=0.25, 
                                             minlen=2))
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport support minlen maxlen
##        0.25    0.1    1 none FALSE            TRUE    0.08      2     10
##  target   ext
##   rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## apriori - find association rules with the apriori algorithm
## version 4.21 (2004.05.09)        (c) 1996-2004   Christian Borgelt
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[79 item(s), 148958 transaction(s)] done [0.03s].
## sorting and recoding items ... [12 item(s)] done [0.00s].
## creating transaction tree ... done [0.05s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [11 rule(s)] done [0.00s].
## creating S4 object  ... done [0.02s].
#inspect(thirdRules[1:10, ])
inspect(sort(thirdRules, by="lift")[1:11, ])
##    lhs      rhs      support confidence      lift
## 1  {04L} => {08A} 0.19168490  0.8053307 2.5667127
## 2  {08A} => {04L} 0.19168490  0.6109292 2.5667127
## 3  {08A} => {04N} 0.10550625  0.3362646 2.3862278
## 4  {04N} => {08A} 0.10550625  0.7487018 2.3862278
## 5  {10B} => {10F} 0.08123095  0.4458200 1.0668024
## 6  {06D} => {10F} 0.08715880  0.4326080 1.0351876
## 7  {02G} => {08A} 0.08307040  0.2932089 0.9345017
## 8  {08A} => {02G} 0.08307040  0.2647581 0.9345017
## 9  {02G} => {10F} 0.10520415  0.3713331 0.8885629
## 10 {10F} => {02G} 0.10520415  0.2517430 0.8885629
## 11 {08A} => {10F} 0.09562427  0.3047692 0.7292822
What this means:

Eight times out of ten, if you’ve got a 04L, you’re going to get an 08A

This could be useful information for investigators and restauranteurs alike.

If you know you’ve got improperly constructed food contact surfaces, you might also want to self-regulate and fix the other commonly occuring violations that usually accompany improperly constructed food contact surfaces.

Some of the most impactful rules included: 1: 04L -> 08A “Evidence of mice or live mice present in facility’s food and/or non-food areas.”

          > "Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist"

2: 08A -> 04N: “Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist”

          > "Filth flies or food/refuse/sewage-associated (FRSA) flies in facility’s food and/or non-food areas."

3: 10B -> 10F: “Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly.”

          > "Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit."

4: 02G -> 08A: “Cold food item held above 41°F (smoked fish and Reduced Oxygen Packaged food above 38°F), except during necessary preparation.”

          > "Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist"

5: 02G -> 10F: “Cold food item held above 41°F (smoked fish and Reduced Oxygen Packaged food above 38°F), except during necessary preparation.”

          > "Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit."

6: 08A -> 10F: “Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist”

          > "Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit."

There are a few great visualizations for association rules in the arulesViz package.

This two-key plot allows us to visualize the three parameters of interest: lift, support, and confidence, all at once. It is great for spotting outliers and identifying clusters of activity.

plot(secondRules, shading="lift", control=list(main = "Two-key plot of 416 Rules", col=sequential_hcl(20))) 

The code for this next plot displays a great interactive plot, but unfortunately it only works within the R environment. It allows you to visually inspect the above plot and manually select the rules you wish you examine. It then provides the summary statistics for those rules.

sel <- plot(secondRules, measure=c("support", "lift"), shading="confidence", interactive=TRUE) 

If we plot a tighter set of rules, we can represent it as an association rules matrix to get a sense of the distribution of the rules.

fourthRules <- apriori(aList, parameter=list(support=0.02,
                                             confidence=0.25,
                                             minlen=2))
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport support minlen maxlen
##        0.25    0.1    1 none FALSE            TRUE    0.02      2     10
##  target   ext
##   rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## apriori - find association rules with the apriori algorithm
## version 4.21 (2004.05.09)        (c) 1996-2004   Christian Borgelt
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[79 item(s), 148958 transaction(s)] done [0.03s].
## sorting and recoding items ... [29 item(s)] done [0.01s].
## creating transaction tree ... done [0.06s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [81 rule(s)] done [0.00s].
## creating S4 object  ... done [0.02s].
#inspect(fourthRules[1:10, ])
inspect(sort(fourthRules, by="lift")[1:11, ])
##    lhs      rhs      support confidence     lift
## 1  {08A,                                        
##     10F} => {04M} 0.02515474  0.2630581 2.946211
## 2  {02G,                                        
##     08A} => {04M} 0.02126774  0.2560207 2.867393
## 3  {04L,                                        
##     04N} => {08A} 0.03098860  0.8699585 2.772692
## 4  {04L,                                        
##     04M} => {08A} 0.02436257  0.8632255 2.751232
## 5  {06C,                                        
##     08A} => {04L} 0.03345238  0.6433828 2.703061
## 6  {04L,                                        
##     06C} => {08A} 0.03345238  0.8464413 2.697739
## 7  {04L,                                        
##     10B} => {08A} 0.03507029  0.8448973 2.692818
## 8  {04L,                                        
##     10F} => {08A} 0.05610306  0.8444826 2.691496
## 9  {06D,                                        
##     08A} => {04L} 0.03517099  0.6375806 2.678684
## 10 {04L,                                        
##     06D} => {08A} 0.03517099  0.8350335 2.661380
## 11 {02B,                                        
##     08A} => {04L} 0.03282133  0.6251119 2.626299
plot(fourthRules, method = "grouped", control = list(k=80, col=sequential_hcl(5))) # -  This is a 

We can visually see that 04L and 04N tends to gravitate towards 08A, which is what our analysis above confirms.

CONCLUSION

Though it’s not quite as impactful as Target being able to tell you when your teenage daughter is pregnant, we can still glean some relationships between the violations that may not have been obvious at first blush.