library(tidyverse)
library(knitr)
library(kableExtra)

Tidyverse Recipe

Often a classification schema is implicitly used and displayed in a large dataset, but the classification schema itself is unavailable in an easy-to-use form. This tidyverse recipe illustrates how to extract the classification data from a large dataset, identify all unique observed values and returned those unique values in a dataframe. In effect, we will use the dplyr package to perform these basic transformations.

An Example: LA Restaurant Health Inspections

A basic example is the set of restaurant health inspections collected for all restaurants in the City of Los Angeles by the Los Angeles Environmental Health Department. The department publishes this dataset as a csv file to Kaggle. The file is large and contains 58,872 rows and 20 columns. Moreover, file is approximately 14MB in size. This file will represent our large dataset for this recipe.

Moreover, certain columns are repetitive and clearly important for underlying the scope and complexity of the inspection assessment. These include pe_description (which means program element description) and program_element_pe (a 4 digit numeric code).

Examples of the pe_description field values are:

Examples of the program_element_pe values are:

We decide to select two columns of the data file: pe_description and program_element_pe to define our classification schema.

The tidyverse recipe will then seek to report all unique 2-tuples of (pe_description, program_element_pe) in the data set.

We will also repeat the recipe to seek all unique 1-tuples of distinct (pe_description) and (program_element_pe)

We can therefore answer three questions with this recipe:

How many distinct pe_description values exist in the LA County Environmental Health classification schema?

For a given pe_description, do all restaurants get assigned the same program_element_pe during their inspection?

What are the distinct values for program_element_pe?

Data Source

inspections = read_csv("https://raw.githubusercontent.com/completegraph/DataStore/master/Tidyverse/restaurant-and-market-health-inspections.csv")

str(inspections)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 58872 obs. of  20 variables:
##  $ serial_number      : chr  "DAJ00E07B" "DAQOKRFZB" "DASJI4LUR" "DAWVA0CY3" ...
##  $ activity_date      : POSIXct, format: "2017-12-29" "2017-12-29" ...
##  $ facility_name      : chr  "HABITAT COFFEE SHOP" "REILLY'S" "STREET CHURROS" "RIO GENTLEMANS CLUB" ...
##  $ score              : num  95 92 93 93 93 94 96 94 93 95 ...
##  $ grade              : chr  "A" "A" "A" "A" ...
##  $ service_code       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ service_description: chr  "ROUTINE INSPECTION" "ROUTINE INSPECTION" "ROUTINE INSPECTION" "ROUTINE INSPECTION" ...
##  $ employee_id        : chr  "EE0000923" "EE0000633" "EE0000835" "EE0000958" ...
##  $ facility_address   : chr  "3708 N EAGLE ROCK BLVD" "100 WORLD WAY # 120" "6801 HOLLYWOOD BLVD # 253" "13124 S FIGUEROA ST" ...
##  $ facility_city      : chr  "LOS ANGELES" "LOS ANGELES" "LOS ANGELES" "LOS ANGELES" ...
##  $ facility_id        : chr  "FA0170465" "FA0244690" "FA0224109" "FA0046462" ...
##  $ facility_state     : chr  "CA" "CA" "CA" "CA" ...
##  $ facility_zip       : chr  "90065" "90045" "90028" "90064" ...
##  $ owner_id           : chr  "OW0178123" "OW0208441" "OW0228670" "OW0011830" ...
##  $ owner_name         : chr  "GLASSELL COFFEE SHOP LLC" "AREAS SKYVIEW LAX JV, LLC" "STREETCHURROS, INC" "FIGUEROA GROUP INC" ...
##  $ pe_description     : chr  "RESTAURANT (0-30) SEATS MODERATE RISK" "RESTAURANT (0-30) SEATS MODERATE RISK" "RESTAURANT (0-30) SEATS LOW RISK" "RESTAURANT (61-150) SEATS LOW RISK" ...
##  $ program_element_pe : num  1631 1631 1630 1636 1638 ...
##  $ program_name       : chr  "HABITAT COFFEE SHOP" "REILLY'S" "STREET CHURROS" "RIO GENTLEMANS CLUB" ...
##  $ program_status     : chr  "ACTIVE" "ACTIVE" "ACTIVE" "ACTIVE" ...
##  $ record_id          : chr  "PR0160774" "PR0193026" "PR0179282" "PR0044776" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   serial_number = col_character(),
##   ..   activity_date = col_datetime(format = ""),
##   ..   facility_name = col_character(),
##   ..   score = col_double(),
##   ..   grade = col_character(),
##   ..   service_code = col_double(),
##   ..   service_description = col_character(),
##   ..   employee_id = col_character(),
##   ..   facility_address = col_character(),
##   ..   facility_city = col_character(),
##   ..   facility_id = col_character(),
##   ..   facility_state = col_character(),
##   ..   facility_zip = col_character(),
##   ..   owner_id = col_character(),
##   ..   owner_name = col_character(),
##   ..   pe_description = col_character(),
##   ..   program_element_pe = col_double(),
##   ..   program_name = col_character(),
##   ..   program_status = col_character(),
##   ..   record_id = col_character()
##   .. )

Using the distinct function in dplyr

The function distinct() accomplishes this goal. There are two ways to apply the function in this example. In the first way, we may choose to extract the subset of columns used to obtain the unique k-tuples. In this case, k=2 or k=1. The second way, a feature of distinct, allows use to obtain the unique k-tuples without extracting the subset of columns of interest. We start by illustrating the first approach.

Approach 1: Taking a subset of columns to create a helper dataframe

We illustrate this approach by extracting an intermediate dataframe to which the distinct function is applied. In the code below, we use select to choose the columns of interest and then distinct to get all distinct k-tuples. Lastly, we sort the unique values by some ordering.

inspections %>% select( pe_description, program_element_pe) -> trimmed_dataframe

trimmed_dataframe %>% distinct() %>% arrange(pe_description, program_element_pe) -> distinct_two_tuples

str(distinct_two_tuples)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 18 obs. of  2 variables:
##  $ pe_description    : chr  "FOOD MKT RETAIL (2,000+ SF) HIGH RISK" "FOOD MKT RETAIL (2,000+ SF) LOW RISK" "FOOD MKT RETAIL (2,000+ SF) MODERATE RISK" "FOOD MKT RETAIL (25-1,999 SF) HIGH RISK" ...
##  $ program_element_pe: num  1615 1613 1614 1612 1610 ...
distinct_two_tuples %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover")) %>%
   scroll_box(width="85%",  height="200px")
pe_description program_element_pe
FOOD MKT RETAIL (2,000+ SF) HIGH RISK 1615
FOOD MKT RETAIL (2,000+ SF) LOW RISK 1613
FOOD MKT RETAIL (2,000+ SF) MODERATE RISK 1614
FOOD MKT RETAIL (25-1,999 SF) HIGH RISK 1612
FOOD MKT RETAIL (25-1,999 SF) LOW RISK 1610
FOOD MKT RETAIL (25-1,999 SF) MODERATE RISK 1611
RESTAURANT (0-30) SEATS HIGH RISK 1632
RESTAURANT (0-30) SEATS LOW RISK 1630
RESTAURANT (0-30) SEATS MODERATE RISK 1631
RESTAURANT (151 + ) SEATS HIGH RISK 1641
RESTAURANT (151 + ) SEATS LOW RISK 1639
RESTAURANT (151 + ) SEATS MODERATE RISK 1640
RESTAURANT (31-60) SEATS HIGH RISK 1635
RESTAURANT (31-60) SEATS LOW RISK 1633
RESTAURANT (31-60) SEATS MODERATE RISK 1634
RESTAURANT (61-150) SEATS HIGH RISK 1638
RESTAURANT (61-150) SEATS LOW RISK 1636
RESTAURANT (61-150) SEATS MODERATE RISK 1637

Let’s test whether the two values are assigned in a 1-to-1 correspondence to each other.

inspections %>% select(pe_description) %>% distinct() %>% arrange(pe_description) -> distinct_pe_description
inspections %>% select(program_element_pe ) %>% distinct() %>% arrange(program_element_pe) -> distinct_program_element_pe

nrow(distinct_pe_description)
## [1] 18
nrow(distinct_program_element_pe)
## [1] 18
nrow(distinct_two_tuples)
## [1] 18

We see that both the 2-tuples and 1-tuples are all the same size. This implies that the mapping between the two columns is 1-to-1.

Approach 2: Finding distinct tuples without a helper dataframe

The second approach is feasible because of the clever syntax of the distinct function. distinct allows the identification of unique k-tuples by allowing the user to specify the k column names to use.

We illustrate the avoidance of the helper dataframe below.

inspections %>% distinct( pe_description, program_element_pe) %>% 
    arrange(pe_description) %>% 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped", "hover")) %>%
    scroll_box(width="90%", height="200px")
pe_description program_element_pe
FOOD MKT RETAIL (2,000+ SF) HIGH RISK 1615
FOOD MKT RETAIL (2,000+ SF) LOW RISK 1613
FOOD MKT RETAIL (2,000+ SF) MODERATE RISK 1614
FOOD MKT RETAIL (25-1,999 SF) HIGH RISK 1612
FOOD MKT RETAIL (25-1,999 SF) LOW RISK 1610
FOOD MKT RETAIL (25-1,999 SF) MODERATE RISK 1611
RESTAURANT (0-30) SEATS HIGH RISK 1632
RESTAURANT (0-30) SEATS LOW RISK 1630
RESTAURANT (0-30) SEATS MODERATE RISK 1631
RESTAURANT (151 + ) SEATS HIGH RISK 1641
RESTAURANT (151 + ) SEATS LOW RISK 1639
RESTAURANT (151 + ) SEATS MODERATE RISK 1640
RESTAURANT (31-60) SEATS HIGH RISK 1635
RESTAURANT (31-60) SEATS LOW RISK 1633
RESTAURANT (31-60) SEATS MODERATE RISK 1634
RESTAURANT (61-150) SEATS HIGH RISK 1638
RESTAURANT (61-150) SEATS LOW RISK 1636
RESTAURANT (61-150) SEATS MODERATE RISK 1637

In addition, if the .keep_all option is set to TRUE, the user can even obtain the first matching entire row of the dataset associated with each unique k-tuple. We illustrate the ability to keep a representative row of the complete set of columns below.

inspections %>% distinct( pe_description, .keep_all=TRUE) %>% 
    arrange(pe_description) %>% 
    select(pe_description, program_element_pe, everything()) %>%   # place the pe_description column first
    kable() %>% 
    kable_styling(bootstrap_options = c("striped", "hover")) %>%
    scroll_box(width="85%", height="200px" )
pe_description program_element_pe serial_number activity_date facility_name score grade service_code service_description employee_id facility_address facility_city facility_id facility_state facility_zip owner_id owner_name program_name program_status record_id
FOOD MKT RETAIL (2,000+ SF) HIGH RISK 1615 DAP9QQTMX 2017-12-27 G AND J MARKET 92 A 1 ROUTINE INSPECTION EE0000798 2045 W MLK BLVD LOS ANGELES FA0024244 CA 90062 OW0019978 KPL PARTNERS, INC. G AND J MARKET ACTIVE PR0024262
FOOD MKT RETAIL (2,000+ SF) LOW RISK 1613 DAKSJB0AB 2017-12-29 7 ELEVEN #37215A 95 A 1 ROUTINE INSPECTION EE0000839 3330 W FLORENCE AVE LOS ANGELES FA0243779 CA 90043-4706 OW0239028 S & U INVESTMENTS, INC. 7 ELEVEN #37215A ACTIVE PR0192029
FOOD MKT RETAIL (2,000+ SF) MODERATE RISK 1614 DAI2IWYYV 2017-12-26 CIRCLE K 93 A 1 ROUTINE INSPECTION EE0000366 3360 E OLYMPIC BLVD LOS ANGELES FA0243565 CA 90023-3724 OW0238848 MARKET AVENUE INC. CIRCLE K ACTIVE PR0191776
FOOD MKT RETAIL (25-1,999 SF) HIGH RISK 1612 DA0OD6RH9 2017-12-27 ALEX FISH MARKET 98 A 1 ROUTINE INSPECTION EE0000798 1451 W MARTIN LUTHER KING JR BLVD LOS ANGELES FA0003285 CA 90062 OW0020175 KWON,UNG ALEX FISH MARKET ACTIVE PR0004662
FOOD MKT RETAIL (25-1,999 SF) LOW RISK 1610 DAEWQMQV2 2017-12-29 MARCE’S MINI MARKET 96 A 1 ROUTINE INSPECTION EE0000707 4160 S AVALON BLVD LOS ANGELES FA0066361 CA 90011 OW0015364 HERNANDEZ, ERNESTINA MARCE’S MINI MARKET ACTIVE PR0010572
FOOD MKT RETAIL (25-1,999 SF) MODERATE RISK 1611 DA1LIO8WC 2017-12-28 LA SIRENA FISH MARKET 98 A 1 ROUTINE INSPECTION EE0000366 4226 E OLYMPIC BLVD LOS ANGELES FA0006810 CA 90023 OW0005654 MARIA CABRALES LA SIRENA FISH MARKET ACTIVE PR0045798
RESTAURANT (0-30) SEATS HIGH RISK 1632 DAMV56BMJ 2017-12-29 THE SPOT GRILL 93 A 1 ROUTINE INSPECTION EE0000727 10004 NATIONAL BLVD LOS ANGELES FA0245224 CA 90034 OW0240313 LIQUOR CLUB INC THE SPOT GRILL INACTIVE PR0193589
RESTAURANT (0-30) SEATS LOW RISK 1630 DASJI4LUR 2017-12-29 STREET CHURROS 93 A 1 ROUTINE INSPECTION EE0000835 6801 HOLLYWOOD BLVD # 253 LOS ANGELES FA0224109 CA 90028 OW0228670 STREETCHURROS, INC STREET CHURROS ACTIVE PR0179282
RESTAURANT (0-30) SEATS MODERATE RISK 1631 DAJ00E07B 2017-12-29 HABITAT COFFEE SHOP 95 A 1 ROUTINE INSPECTION EE0000923 3708 N EAGLE ROCK BLVD LOS ANGELES FA0170465 CA 90065 OW0178123 GLASSELL COFFEE SHOP LLC HABITAT COFFEE SHOP ACTIVE PR0160774
RESTAURANT (151 + ) SEATS HIGH RISK 1641 DA85ZLP32 2017-12-28 THE FOUNDATION CENTER 97 A 1 ROUTINE INSPECTION EE0000224 11633 S WESTERN AVE LOS ANGELES FA0165629 CA 90047 OW0128872 NOAH’S DWELLING, LLC THE FOUNDATION CENTER ACTIVE PR0153961
RESTAURANT (151 + ) SEATS LOW RISK 1639 DA1021441 2017-12-26 WESTIN LAKEVIEW BISTRO 94 A 1 ROUTINE INSPECTION EE0000721 404 S FIGUEROA ST LBBY LOBBY LOS ANGELES FA0073897 CA 90071 OW0016394 HOTEL BONAVENTURE LTD PARTNE WESTIN LOBBY COURT ACTIVE PR0010971
RESTAURANT (151 + ) SEATS MODERATE RISK 1640 DADCYUBZ0 2017-12-28 SPECIALTY’S CAFE & BAKERY 94 A 1 ROUTINE INSPECTION EE0000924 2121 AVE OF STARS STE #100 LOS ANGELES FA0033907 CA 90067 OW0033774 SPECIALTY’S CAFE & BAKERY, INC. SPECIALTY’S CAFE & BAKERY ACTIVE PR0018808
RESTAURANT (31-60) SEATS HIGH RISK 1635 DAHKJFBMR 2017-12-28 UPSTAIRS 95 A 1 ROUTINE INSPECTION EE0000581 3707 N CAHUENGA BLVD LOS ANGELES FA0137987 CA 91604 OW0101400 CLARITY MANAGEMENT, INC. UPSTAIRS ACTIVE PR0122211
RESTAURANT (31-60) SEATS LOW RISK 1633 DAQMQMDOR 2017-12-26 ADMIRALS CLUB 99 A 1 ROUTINE INSPECTION EE0000122 400 WORLD WAY LOS ANGELES FA0004930 CA 90045 OW0125020 AMERICAN AIRLINES, INC. ADMIRALS CLUB CHAMPAGNE BAR ACTIVE PR0201945
RESTAURANT (31-60) SEATS MODERATE RISK 1634 DAQKIE2OT 2017-12-27 SUBWAY -52105 95 A 1 ROUTINE INSPECTION EE0000721 255 S GRAND AVE STE 101 LOS ANGELES FA0034746 CA 90012 OW0005591 C G INVESTMENTS INC SUBWAY -52105 ACTIVE PR0017785
RESTAURANT (61-150) SEATS HIGH RISK 1638 DAKFCHD0L 2017-12-29 LE PAIN QUOTIDIEN 93 A 1 ROUTINE INSPECTION EE0000629 13050 SAN VICENTE BLVD STE 114 LOS ANGELES FA0034788 CA 90049 OW0028928 PQ SAN VICENTE INC. LE PAIN QUOTIDIEN ACTIVE PR0017456
RESTAURANT (61-150) SEATS LOW RISK 1636 DAWVA0CY3 2017-12-29 RIO GENTLEMANS CLUB 93 A 1 ROUTINE INSPECTION EE0000958 13124 S FIGUEROA ST LOS ANGELES FA0046462 CA 90064 OW0011830 FIGUEROA GROUP INC RIO GENTLEMANS CLUB ACTIVE PR0044776
RESTAURANT (61-150) SEATS MODERATE RISK 1637 DAF0GFYS4 2017-12-28 DELTA SKY CLUB T3 93 A 1 ROUTINE INSPECTION EE0000633 300 WORLD WAY LOS ANGELES FA0248072 CA 90045 OW0242509 DELTA AIR LINES, INC. DELTA SKY CLUB T3 ACTIVE PR0196711

Conclusion

The ability to extract classification schema from a larger data set is a frequently required recipe. Outside of data science, one can easily do this process using Excel and a pivot table to get distinct values for a single column. It is even possible though non-trivial to obtain the unique 2-tuples. However, it is more difficult to get a representative row of an entire dataset in Excel using the pivot table approach for getting distinct values. Thus, the tidyverse recipe approach clearly shows more flexibility and power rather quickly.