LA Restaurant Health Inspections

Tidyverse Recipe

Often a classification schema is implicitly used and displayed in a large dataset, but the classification schema itself is unavailable in an easy-to-use form. This tidyverse recipe illustrates how to extract the classification data from a large dataset, identify all unique observed values and returned those unique values in a dataframe. In effect, we will use the dplyr package to perform these basic transformations.

An Example: LA Restaurant Health Inspections

A basic example is the set of restaurant health inspections collected for all restaurants in the City of Los Angeles by the Los Angeles Environmental Health Department. The department publishes this dataset as a csv file to Kaggle. The file is large and contains 58,872 rows and 20 columns. Moreover, file is approximately 14MB in size. This file will represent our large dataset for this recipe.

Moreover, certain columns are repetitive and clearly important for underlying the scope and complexity of the inspection assessment. These include pe_description (which means program element description) and program_element_pe (a 4 digit numeric code).

Examples of the pe_description field values are:

pe_description: RESTAURANT (0-30) SEATS MODERATE RISK
pe_description: RESTAURANT (61-150) SEATS HIGH RISK
pe_description: FOOD MKT RETAIL (25-1,999 SF) LOW RISK

Examples of the program_element_pe values are:

1631
1630
1636

We decide to select two columns of the data file: pe_description and program_element_pe to define our classification schema.

The tidyverse recipe will then seek to report all unique 2-tuples of (pe_description, program_element_pe) in the data set.

We will also repeat the recipe to seek all unique 1-tuples of distinct (pe_description) and (program_element_pe)

We can therefore answer three questions with this recipe:

How many distinct pe_description values exist in the LA County Environmental Health classification schema?

For a given pe_description, do all restaurants get assigned the same program_element_pe during their inspection?

What are the distinct values for program_element_pe?

Data Source

inspections = read_csv("https://raw.githubusercontent.com/completegraph/DataStore/master/Tidyverse/restaurant-and-market-health-inspections.csv")

str(inspections)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 58872 obs. of  20 variables:
##  $ serial_number      : chr  "DAJ00E07B" "DAQOKRFZB" "DASJI4LUR" "DAWVA0CY3" ...
##  $ activity_date      : POSIXct, format: "2017-12-29" "2017-12-29" ...
##  $ facility_name      : chr  "HABITAT COFFEE SHOP" "REILLY'S" "STREET CHURROS" "RIO GENTLEMANS CLUB" ...
##  $ score              : num  95 92 93 93 93 94 96 94 93 95 ...
##  $ grade              : chr  "A" "A" "A" "A" ...
##  $ service_code       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ service_description: chr  "ROUTINE INSPECTION" "ROUTINE INSPECTION" "ROUTINE INSPECTION" "ROUTINE INSPECTION" ...
##  $ employee_id        : chr  "EE0000923" "EE0000633" "EE0000835" "EE0000958" ...
##  $ facility_address   : chr  "3708 N EAGLE ROCK BLVD" "100 WORLD WAY # 120" "6801 HOLLYWOOD BLVD # 253" "13124 S FIGUEROA ST" ...
##  $ facility_city      : chr  "LOS ANGELES" "LOS ANGELES" "LOS ANGELES" "LOS ANGELES" ...
##  $ facility_id        : chr  "FA0170465" "FA0244690" "FA0224109" "FA0046462" ...
##  $ facility_state     : chr  "CA" "CA" "CA" "CA" ...
##  $ facility_zip       : chr  "90065" "90045" "90028" "90064" ...
##  $ owner_id           : chr  "OW0178123" "OW0208441" "OW0228670" "OW0011830" ...
##  $ owner_name         : chr  "GLASSELL COFFEE SHOP LLC" "AREAS SKYVIEW LAX JV, LLC" "STREETCHURROS, INC" "FIGUEROA GROUP INC" ...
##  $ pe_description     : chr  "RESTAURANT (0-30) SEATS MODERATE RISK" "RESTAURANT (0-30) SEATS MODERATE RISK" "RESTAURANT (0-30) SEATS LOW RISK" "RESTAURANT (61-150) SEATS LOW RISK" ...
##  $ program_element_pe : num  1631 1631 1630 1636 1638 ...
##  $ program_name       : chr  "HABITAT COFFEE SHOP" "REILLY'S" "STREET CHURROS" "RIO GENTLEMANS CLUB" ...
##  $ program_status     : chr  "ACTIVE" "ACTIVE" "ACTIVE" "ACTIVE" ...
##  $ record_id          : chr  "PR0160774" "PR0193026" "PR0179282" "PR0044776" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   serial_number = col_character(),
##   ..   activity_date = col_datetime(format = ""),
##   ..   facility_name = col_character(),
##   ..   score = col_double(),
##   ..   grade = col_character(),
##   ..   service_code = col_double(),
##   ..   service_description = col_character(),
##   ..   employee_id = col_character(),
##   ..   facility_address = col_character(),
##   ..   facility_city = col_character(),
##   ..   facility_id = col_character(),
##   ..   facility_state = col_character(),
##   ..   facility_zip = col_character(),
##   ..   owner_id = col_character(),
##   ..   owner_name = col_character(),
##   ..   pe_description = col_character(),
##   ..   program_element_pe = col_double(),
##   ..   program_name = col_character(),
##   ..   program_status = col_character(),
##   ..   record_id = col_character()
##   .. )

Using the distinct function in dplyr

The function distinct() accomplishes this goal. There are two ways to apply the function in this example. In the first way, we may choose to extract the subset of columns used to obtain the unique k-tuples. In this case, k=2 or k=1. The second way, a feature of distinct, allows use to obtain the unique k-tuples without extracting the subset of columns of interest. We start by illustrating the first approach.

Approach 1: Taking a subset of columns to create a helper dataframe

We illustrate this approach by extracting an intermediate dataframe to which the distinct function is applied. In the code below, we use select to choose the columns of interest and then distinct to get all distinct k-tuples. Lastly, we sort the unique values by some ordering.

inspections %>% select( pe_description, program_element_pe) -> trimmed_dataframe

trimmed_dataframe %>% distinct() %>% arrange(pe_description, program_element_pe) -> distinct_two_tuples

str(distinct_two_tuples)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 18 obs. of  2 variables:
##  $ pe_description    : chr  "FOOD MKT RETAIL (2,000+ SF) HIGH RISK" "FOOD MKT RETAIL (2,000+ SF) LOW RISK" "FOOD MKT RETAIL (2,000+ SF) MODERATE RISK" "FOOD MKT RETAIL (25-1,999 SF) HIGH RISK" ...
##  $ program_element_pe: num  1615 1613 1614 1612 1610 ...

distinct_two_tuples %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover")) %>%
   scroll_box(width="85%",  height="200px")

pe_description	program_element_pe
FOOD MKT RETAIL (2,000+ SF) HIGH RISK	1615
FOOD MKT RETAIL (2,000+ SF) LOW RISK	1613
FOOD MKT RETAIL (2,000+ SF) MODERATE RISK	1614
FOOD MKT RETAIL (25-1,999 SF) HIGH RISK	1612
FOOD MKT RETAIL (25-1,999 SF) LOW RISK	1610
FOOD MKT RETAIL (25-1,999 SF) MODERATE RISK	1611
RESTAURANT (0-30) SEATS HIGH RISK	1632
RESTAURANT (0-30) SEATS LOW RISK	1630
RESTAURANT (0-30) SEATS MODERATE RISK	1631
RESTAURANT (151 + ) SEATS HIGH RISK	1641
RESTAURANT (151 + ) SEATS LOW RISK	1639
RESTAURANT (151 + ) SEATS MODERATE RISK	1640
RESTAURANT (31-60) SEATS HIGH RISK	1635
RESTAURANT (31-60) SEATS LOW RISK	1633
RESTAURANT (31-60) SEATS MODERATE RISK	1634
RESTAURANT (61-150) SEATS HIGH RISK	1638
RESTAURANT (61-150) SEATS LOW RISK	1636
RESTAURANT (61-150) SEATS MODERATE RISK	1637

Let’s test whether the two values are assigned in a 1-to-1 correspondence to each other.

inspections %>% select(pe_description) %>% distinct() %>% arrange(pe_description) -> distinct_pe_description
inspections %>% select(program_element_pe ) %>% distinct() %>% arrange(program_element_pe) -> distinct_program_element_pe

nrow(distinct_pe_description)

## [1] 18

nrow(distinct_program_element_pe)

## [1] 18

nrow(distinct_two_tuples)

## [1] 18

We see that both the 2-tuples and 1-tuples are all the same size. This implies that the mapping between the two columns is 1-to-1.

Approach 2: Finding distinct tuples without a helper dataframe

The second approach is feasible because of the clever syntax of the distinct function. distinct allows the identification of unique k-tuples by allowing the user to specify the k column names to use.

We illustrate the avoidance of the helper dataframe below.

inspections %>% distinct( pe_description, program_element_pe) %>% 
    arrange(pe_description) %>% 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped", "hover")) %>%
    scroll_box(width="90%", height="200px")

pe_description	program_element_pe
FOOD MKT RETAIL (2,000+ SF) HIGH RISK	1615
FOOD MKT RETAIL (2,000+ SF) LOW RISK	1613
FOOD MKT RETAIL (2,000+ SF) MODERATE RISK	1614
FOOD MKT RETAIL (25-1,999 SF) HIGH RISK	1612
FOOD MKT RETAIL (25-1,999 SF) LOW RISK	1610
FOOD MKT RETAIL (25-1,999 SF) MODERATE RISK	1611
RESTAURANT (0-30) SEATS HIGH RISK	1632
RESTAURANT (0-30) SEATS LOW RISK	1630
RESTAURANT (0-30) SEATS MODERATE RISK	1631
RESTAURANT (151 + ) SEATS HIGH RISK	1641
RESTAURANT (151 + ) SEATS LOW RISK	1639
RESTAURANT (151 + ) SEATS MODERATE RISK	1640
RESTAURANT (31-60) SEATS HIGH RISK	1635
RESTAURANT (31-60) SEATS LOW RISK	1633
RESTAURANT (31-60) SEATS MODERATE RISK	1634
RESTAURANT (61-150) SEATS HIGH RISK	1638
RESTAURANT (61-150) SEATS LOW RISK	1636
RESTAURANT (61-150) SEATS MODERATE RISK	1637

In addition, if the .keep_all option is set to TRUE, the user can even obtain the first matching entire row of the dataset associated with each unique k-tuple. We illustrate the ability to keep a representative row of the complete set of columns below.

inspections %>% distinct( pe_description, .keep_all=TRUE) %>% 
    arrange(pe_description) %>% 
    select(pe_description, program_element_pe, everything()) %>%   # place the pe_description column first
    kable() %>% 
    kable_styling(bootstrap_options = c("striped", "hover")) %>%
    scroll_box(width="85%", height="200px" )

pe_description	program_element_pe	serial_number	activity_date	facility_name	score	grade	service_code	service_description	employee_id	facility_address	facility_city	facility_id	facility_state	facility_zip	owner_id	owner_name	program_name	program_status	record_id
FOOD MKT RETAIL (2,000+ SF) HIGH RISK	1615	DAP9QQTMX	2017-12-27	G AND J MARKET	92	A	1	ROUTINE INSPECTION	EE0000798	2045 W MLK BLVD	LOS ANGELES	FA0024244	CA	90062	OW0019978	KPL PARTNERS, INC.	G AND J MARKET	ACTIVE	PR0024262
FOOD MKT RETAIL (2,000+ SF) LOW RISK	1613	DAKSJB0AB	2017-12-29	7 ELEVEN #37215A	95	A	1	ROUTINE INSPECTION	EE0000839	3330 W FLORENCE AVE	LOS ANGELES	FA0243779	CA	90043-4706	OW0239028	S & U INVESTMENTS, INC.	7 ELEVEN #37215A	ACTIVE	PR0192029
FOOD MKT RETAIL (2,000+ SF) MODERATE RISK	1614	DAI2IWYYV	2017-12-26	CIRCLE K	93	A	1	ROUTINE INSPECTION	EE0000366	3360 E OLYMPIC BLVD	LOS ANGELES	FA0243565	CA	90023-3724	OW0238848	MARKET AVENUE INC.	CIRCLE K	ACTIVE	PR0191776
FOOD MKT RETAIL (25-1,999 SF) HIGH RISK	1612	DA0OD6RH9	2017-12-27	ALEX FISH MARKET	98	A	1	ROUTINE INSPECTION	EE0000798	1451 W MARTIN LUTHER KING JR BLVD	LOS ANGELES	FA0003285	CA	90062	OW0020175	KWON,UNG	ALEX FISH MARKET	ACTIVE	PR0004662
FOOD MKT RETAIL (25-1,999 SF) LOW RISK	1610	DAEWQMQV2	2017-12-29	MARCE’S MINI MARKET	96	A	1	ROUTINE INSPECTION	EE0000707	4160 S AVALON BLVD	LOS ANGELES	FA0066361	CA	90011	OW0015364	HERNANDEZ, ERNESTINA	MARCE’S MINI MARKET	ACTIVE	PR0010572
FOOD MKT RETAIL (25-1,999 SF) MODERATE RISK	1611	DA1LIO8WC	2017-12-28	LA SIRENA FISH MARKET	98	A	1	ROUTINE INSPECTION	EE0000366	4226 E OLYMPIC BLVD	LOS ANGELES	FA0006810	CA	90023	OW0005654	MARIA CABRALES	LA SIRENA FISH MARKET	ACTIVE	PR0045798
RESTAURANT (0-30) SEATS HIGH RISK	1632	DAMV56BMJ	2017-12-29	THE SPOT GRILL	93	A	1	ROUTINE INSPECTION	EE0000727	10004 NATIONAL BLVD	LOS ANGELES	FA0245224	CA	90034	OW0240313	LIQUOR CLUB INC	THE SPOT GRILL	INACTIVE	PR0193589
RESTAURANT (0-30) SEATS LOW RISK	1630	DASJI4LUR	2017-12-29	STREET CHURROS	93	A	1	ROUTINE INSPECTION	EE0000835	6801 HOLLYWOOD BLVD # 253	LOS ANGELES	FA0224109	CA	90028	OW0228670	STREETCHURROS, INC	STREET CHURROS	ACTIVE	PR0179282
RESTAURANT (0-30) SEATS MODERATE RISK	1631	DAJ00E07B	2017-12-29	HABITAT COFFEE SHOP	95	A	1	ROUTINE INSPECTION	EE0000923	3708 N EAGLE ROCK BLVD	LOS ANGELES	FA0170465	CA	90065	OW0178123	GLASSELL COFFEE SHOP LLC	HABITAT COFFEE SHOP	ACTIVE	PR0160774
RESTAURANT (151 + ) SEATS HIGH RISK	1641	DA85ZLP32	2017-12-28	THE FOUNDATION CENTER	97	A	1	ROUTINE INSPECTION	EE0000224	11633 S WESTERN AVE	LOS ANGELES	FA0165629	CA	90047	OW0128872	NOAH’S DWELLING, LLC	THE FOUNDATION CENTER	ACTIVE	PR0153961
RESTAURANT (151 + ) SEATS LOW RISK	1639	DA1021441	2017-12-26	WESTIN LAKEVIEW BISTRO	94	A	1	ROUTINE INSPECTION	EE0000721	404 S FIGUEROA ST LBBY LOBBY	LOS ANGELES	FA0073897	CA	90071	OW0016394	HOTEL BONAVENTURE LTD PARTNE	WESTIN LOBBY COURT	ACTIVE	PR0010971
RESTAURANT (151 + ) SEATS MODERATE RISK	1640	DADCYUBZ0	2017-12-28	SPECIALTY’S CAFE & BAKERY	94	A	1	ROUTINE INSPECTION	EE0000924	2121 AVE OF STARS STE #100	LOS ANGELES	FA0033907	CA	90067	OW0033774	SPECIALTY’S CAFE & BAKERY, INC.	SPECIALTY’S CAFE & BAKERY	ACTIVE	PR0018808
RESTAURANT (31-60) SEATS HIGH RISK	1635	DAHKJFBMR	2017-12-28	UPSTAIRS	95	A	1	ROUTINE INSPECTION	EE0000581	3707 N CAHUENGA BLVD	LOS ANGELES	FA0137987	CA	91604	OW0101400	CLARITY MANAGEMENT, INC.	UPSTAIRS	ACTIVE	PR0122211
RESTAURANT (31-60) SEATS LOW RISK	1633	DAQMQMDOR	2017-12-26	ADMIRALS CLUB	99	A	1	ROUTINE INSPECTION	EE0000122	400 WORLD WAY	LOS ANGELES	FA0004930	CA	90045	OW0125020	AMERICAN AIRLINES, INC.	ADMIRALS CLUB CHAMPAGNE BAR	ACTIVE	PR0201945
RESTAURANT (31-60) SEATS MODERATE RISK	1634	DAQKIE2OT	2017-12-27	SUBWAY -52105	95	A	1	ROUTINE INSPECTION	EE0000721	255 S GRAND AVE STE 101	LOS ANGELES	FA0034746	CA	90012	OW0005591	C G INVESTMENTS INC	SUBWAY -52105	ACTIVE	PR0017785
RESTAURANT (61-150) SEATS HIGH RISK	1638	DAKFCHD0L	2017-12-29	LE PAIN QUOTIDIEN	93	A	1	ROUTINE INSPECTION	EE0000629	13050 SAN VICENTE BLVD STE 114	LOS ANGELES	FA0034788	CA	90049	OW0028928	PQ SAN VICENTE INC.	LE PAIN QUOTIDIEN	ACTIVE	PR0017456
RESTAURANT (61-150) SEATS LOW RISK	1636	DAWVA0CY3	2017-12-29	RIO GENTLEMANS CLUB	93	A	1	ROUTINE INSPECTION	EE0000958	13124 S FIGUEROA ST	LOS ANGELES	FA0046462	CA	90064	OW0011830	FIGUEROA GROUP INC	RIO GENTLEMANS CLUB	ACTIVE	PR0044776
RESTAURANT (61-150) SEATS MODERATE RISK	1637	DAF0GFYS4	2017-12-28	DELTA SKY CLUB T3	93	A	1	ROUTINE INSPECTION	EE0000633	300 WORLD WAY	LOS ANGELES	FA0248072	CA	90045	OW0242509	DELTA AIR LINES, INC.	DELTA SKY CLUB T3	ACTIVE	PR0196711

Conclusion

The ability to extract classification schema from a larger data set is a frequently required recipe. Outside of data science, one can easily do this process using Excel and a pivot table to get distinct values for a single column. It is even possible though non-trivial to obtain the unique 2-tuples. However, it is more difficult to get a representative row of an entire dataset in Excel using the pivot table approach for getting distinct values. Thus, the tidyverse recipe approach clearly shows more flexibility and power rather quickly.

LA Restaurant Health Inspections - Tidyverse Recipe

Alexander Ng