library(tidyverse)
library(knitr)
library(kableExtra)
Often a classification schema is implicitly used and displayed in a large dataset, but the classification schema itself is unavailable in an easy-to-use form. This tidyverse recipe illustrates how to extract the classification data from a large dataset, identify all unique observed values and returned those unique values in a dataframe. In effect, we will use the dplyr package to perform these basic transformations.
A basic example is the set of restaurant health inspections collected for all restaurants in the City of Los Angeles by the Los Angeles Environmental Health Department. The department publishes this dataset as a csv file to Kaggle. The file is large and contains 58,872 rows and 20 columns. Moreover, file is approximately 14MB in size. This file will represent our large dataset for this recipe.
Moreover, certain columns are repetitive and clearly important for underlying the scope and complexity of the inspection assessment. These include pe_description (which means program element description) and program_element_pe (a 4 digit numeric code).
Examples of the pe_description field values are:
Examples of the program_element_pe values are:
We decide to select two columns of the data file: pe_description and program_element_pe to define our classification schema.
The tidyverse recipe will then seek to report all unique 2-tuples of (pe_description, program_element_pe) in the data set.
We will also repeat the recipe to seek all unique 1-tuples of distinct (pe_description) and (program_element_pe)
We can therefore answer three questions with this recipe:
How many distinct pe_description values exist in the LA County Environmental Health classification schema?
For a given pe_description, do all restaurants get assigned the same program_element_pe during their inspection?
What are the distinct values for program_element_pe?
inspections = read_csv("https://raw.githubusercontent.com/completegraph/DataStore/master/Tidyverse/restaurant-and-market-health-inspections.csv")
str(inspections)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 58872 obs. of 20 variables:
## $ serial_number : chr "DAJ00E07B" "DAQOKRFZB" "DASJI4LUR" "DAWVA0CY3" ...
## $ activity_date : POSIXct, format: "2017-12-29" "2017-12-29" ...
## $ facility_name : chr "HABITAT COFFEE SHOP" "REILLY'S" "STREET CHURROS" "RIO GENTLEMANS CLUB" ...
## $ score : num 95 92 93 93 93 94 96 94 93 95 ...
## $ grade : chr "A" "A" "A" "A" ...
## $ service_code : num 1 1 1 1 1 1 1 1 1 1 ...
## $ service_description: chr "ROUTINE INSPECTION" "ROUTINE INSPECTION" "ROUTINE INSPECTION" "ROUTINE INSPECTION" ...
## $ employee_id : chr "EE0000923" "EE0000633" "EE0000835" "EE0000958" ...
## $ facility_address : chr "3708 N EAGLE ROCK BLVD" "100 WORLD WAY # 120" "6801 HOLLYWOOD BLVD # 253" "13124 S FIGUEROA ST" ...
## $ facility_city : chr "LOS ANGELES" "LOS ANGELES" "LOS ANGELES" "LOS ANGELES" ...
## $ facility_id : chr "FA0170465" "FA0244690" "FA0224109" "FA0046462" ...
## $ facility_state : chr "CA" "CA" "CA" "CA" ...
## $ facility_zip : chr "90065" "90045" "90028" "90064" ...
## $ owner_id : chr "OW0178123" "OW0208441" "OW0228670" "OW0011830" ...
## $ owner_name : chr "GLASSELL COFFEE SHOP LLC" "AREAS SKYVIEW LAX JV, LLC" "STREETCHURROS, INC" "FIGUEROA GROUP INC" ...
## $ pe_description : chr "RESTAURANT (0-30) SEATS MODERATE RISK" "RESTAURANT (0-30) SEATS MODERATE RISK" "RESTAURANT (0-30) SEATS LOW RISK" "RESTAURANT (61-150) SEATS LOW RISK" ...
## $ program_element_pe : num 1631 1631 1630 1636 1638 ...
## $ program_name : chr "HABITAT COFFEE SHOP" "REILLY'S" "STREET CHURROS" "RIO GENTLEMANS CLUB" ...
## $ program_status : chr "ACTIVE" "ACTIVE" "ACTIVE" "ACTIVE" ...
## $ record_id : chr "PR0160774" "PR0193026" "PR0179282" "PR0044776" ...
## - attr(*, "spec")=
## .. cols(
## .. serial_number = col_character(),
## .. activity_date = col_datetime(format = ""),
## .. facility_name = col_character(),
## .. score = col_double(),
## .. grade = col_character(),
## .. service_code = col_double(),
## .. service_description = col_character(),
## .. employee_id = col_character(),
## .. facility_address = col_character(),
## .. facility_city = col_character(),
## .. facility_id = col_character(),
## .. facility_state = col_character(),
## .. facility_zip = col_character(),
## .. owner_id = col_character(),
## .. owner_name = col_character(),
## .. pe_description = col_character(),
## .. program_element_pe = col_double(),
## .. program_name = col_character(),
## .. program_status = col_character(),
## .. record_id = col_character()
## .. )
The function distinct() accomplishes this goal. There are two ways to apply the function in this example. In the first way, we may choose to extract the subset of columns used to obtain the unique k-tuples. In this case, k=2 or k=1. The second way, a feature of distinct, allows use to obtain the unique k-tuples without extracting the subset of columns of interest. We start by illustrating the first approach.
We illustrate this approach by extracting an intermediate dataframe to which the distinct function is applied. In the code below, we use select to choose the columns of interest and then distinct to get all distinct k-tuples. Lastly, we sort the unique values by some ordering.
inspections %>% select( pe_description, program_element_pe) -> trimmed_dataframe
trimmed_dataframe %>% distinct() %>% arrange(pe_description, program_element_pe) -> distinct_two_tuples
str(distinct_two_tuples)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 18 obs. of 2 variables:
## $ pe_description : chr "FOOD MKT RETAIL (2,000+ SF) HIGH RISK" "FOOD MKT RETAIL (2,000+ SF) LOW RISK" "FOOD MKT RETAIL (2,000+ SF) MODERATE RISK" "FOOD MKT RETAIL (25-1,999 SF) HIGH RISK" ...
## $ program_element_pe: num 1615 1613 1614 1612 1610 ...
distinct_two_tuples %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover")) %>%
scroll_box(width="85%", height="200px")
| pe_description | program_element_pe |
|---|---|
| FOOD MKT RETAIL (2,000+ SF) HIGH RISK | 1615 |
| FOOD MKT RETAIL (2,000+ SF) LOW RISK | 1613 |
| FOOD MKT RETAIL (2,000+ SF) MODERATE RISK | 1614 |
| FOOD MKT RETAIL (25-1,999 SF) HIGH RISK | 1612 |
| FOOD MKT RETAIL (25-1,999 SF) LOW RISK | 1610 |
| FOOD MKT RETAIL (25-1,999 SF) MODERATE RISK | 1611 |
| RESTAURANT (0-30) SEATS HIGH RISK | 1632 |
| RESTAURANT (0-30) SEATS LOW RISK | 1630 |
| RESTAURANT (0-30) SEATS MODERATE RISK | 1631 |
| RESTAURANT (151 + ) SEATS HIGH RISK | 1641 |
| RESTAURANT (151 + ) SEATS LOW RISK | 1639 |
| RESTAURANT (151 + ) SEATS MODERATE RISK | 1640 |
| RESTAURANT (31-60) SEATS HIGH RISK | 1635 |
| RESTAURANT (31-60) SEATS LOW RISK | 1633 |
| RESTAURANT (31-60) SEATS MODERATE RISK | 1634 |
| RESTAURANT (61-150) SEATS HIGH RISK | 1638 |
| RESTAURANT (61-150) SEATS LOW RISK | 1636 |
| RESTAURANT (61-150) SEATS MODERATE RISK | 1637 |
Let’s test whether the two values are assigned in a 1-to-1 correspondence to each other.
inspections %>% select(pe_description) %>% distinct() %>% arrange(pe_description) -> distinct_pe_description
inspections %>% select(program_element_pe ) %>% distinct() %>% arrange(program_element_pe) -> distinct_program_element_pe
nrow(distinct_pe_description)
## [1] 18
nrow(distinct_program_element_pe)
## [1] 18
nrow(distinct_two_tuples)
## [1] 18
We see that both the 2-tuples and 1-tuples are all the same size. This implies that the mapping between the two columns is 1-to-1.
The second approach is feasible because of the clever syntax of the distinct function. distinct allows the identification of unique k-tuples by allowing the user to specify the k column names to use.
We illustrate the avoidance of the helper dataframe below.
inspections %>% distinct( pe_description, program_element_pe) %>%
arrange(pe_description) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
scroll_box(width="90%", height="200px")
| pe_description | program_element_pe |
|---|---|
| FOOD MKT RETAIL (2,000+ SF) HIGH RISK | 1615 |
| FOOD MKT RETAIL (2,000+ SF) LOW RISK | 1613 |
| FOOD MKT RETAIL (2,000+ SF) MODERATE RISK | 1614 |
| FOOD MKT RETAIL (25-1,999 SF) HIGH RISK | 1612 |
| FOOD MKT RETAIL (25-1,999 SF) LOW RISK | 1610 |
| FOOD MKT RETAIL (25-1,999 SF) MODERATE RISK | 1611 |
| RESTAURANT (0-30) SEATS HIGH RISK | 1632 |
| RESTAURANT (0-30) SEATS LOW RISK | 1630 |
| RESTAURANT (0-30) SEATS MODERATE RISK | 1631 |
| RESTAURANT (151 + ) SEATS HIGH RISK | 1641 |
| RESTAURANT (151 + ) SEATS LOW RISK | 1639 |
| RESTAURANT (151 + ) SEATS MODERATE RISK | 1640 |
| RESTAURANT (31-60) SEATS HIGH RISK | 1635 |
| RESTAURANT (31-60) SEATS LOW RISK | 1633 |
| RESTAURANT (31-60) SEATS MODERATE RISK | 1634 |
| RESTAURANT (61-150) SEATS HIGH RISK | 1638 |
| RESTAURANT (61-150) SEATS LOW RISK | 1636 |
| RESTAURANT (61-150) SEATS MODERATE RISK | 1637 |
In addition, if the .keep_all option is set to TRUE, the user can even obtain the first matching entire row of the dataset associated with each unique k-tuple. We illustrate the ability to keep a representative row of the complete set of columns below.
inspections %>% distinct( pe_description, .keep_all=TRUE) %>%
arrange(pe_description) %>%
select(pe_description, program_element_pe, everything()) %>% # place the pe_description column first
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
scroll_box(width="85%", height="200px" )
| pe_description | program_element_pe | serial_number | activity_date | facility_name | score | grade | service_code | service_description | employee_id | facility_address | facility_city | facility_id | facility_state | facility_zip | owner_id | owner_name | program_name | program_status | record_id |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| FOOD MKT RETAIL (2,000+ SF) HIGH RISK | 1615 | DAP9QQTMX | 2017-12-27 | G AND J MARKET | 92 | A | 1 | ROUTINE INSPECTION | EE0000798 | 2045 W MLK BLVD | LOS ANGELES | FA0024244 | CA | 90062 | OW0019978 | KPL PARTNERS, INC. | G AND J MARKET | ACTIVE | PR0024262 |
| FOOD MKT RETAIL (2,000+ SF) LOW RISK | 1613 | DAKSJB0AB | 2017-12-29 | 7 ELEVEN #37215A | 95 | A | 1 | ROUTINE INSPECTION | EE0000839 | 3330 W FLORENCE AVE | LOS ANGELES | FA0243779 | CA | 90043-4706 | OW0239028 | S & U INVESTMENTS, INC. | 7 ELEVEN #37215A | ACTIVE | PR0192029 |
| FOOD MKT RETAIL (2,000+ SF) MODERATE RISK | 1614 | DAI2IWYYV | 2017-12-26 | CIRCLE K | 93 | A | 1 | ROUTINE INSPECTION | EE0000366 | 3360 E OLYMPIC BLVD | LOS ANGELES | FA0243565 | CA | 90023-3724 | OW0238848 | MARKET AVENUE INC. | CIRCLE K | ACTIVE | PR0191776 |
| FOOD MKT RETAIL (25-1,999 SF) HIGH RISK | 1612 | DA0OD6RH9 | 2017-12-27 | ALEX FISH MARKET | 98 | A | 1 | ROUTINE INSPECTION | EE0000798 | 1451 W MARTIN LUTHER KING JR BLVD | LOS ANGELES | FA0003285 | CA | 90062 | OW0020175 | KWON,UNG | ALEX FISH MARKET | ACTIVE | PR0004662 |
| FOOD MKT RETAIL (25-1,999 SF) LOW RISK | 1610 | DAEWQMQV2 | 2017-12-29 | MARCE’S MINI MARKET | 96 | A | 1 | ROUTINE INSPECTION | EE0000707 | 4160 S AVALON BLVD | LOS ANGELES | FA0066361 | CA | 90011 | OW0015364 | HERNANDEZ, ERNESTINA | MARCE’S MINI MARKET | ACTIVE | PR0010572 |
| FOOD MKT RETAIL (25-1,999 SF) MODERATE RISK | 1611 | DA1LIO8WC | 2017-12-28 | LA SIRENA FISH MARKET | 98 | A | 1 | ROUTINE INSPECTION | EE0000366 | 4226 E OLYMPIC BLVD | LOS ANGELES | FA0006810 | CA | 90023 | OW0005654 | MARIA CABRALES | LA SIRENA FISH MARKET | ACTIVE | PR0045798 |
| RESTAURANT (0-30) SEATS HIGH RISK | 1632 | DAMV56BMJ | 2017-12-29 | THE SPOT GRILL | 93 | A | 1 | ROUTINE INSPECTION | EE0000727 | 10004 NATIONAL BLVD | LOS ANGELES | FA0245224 | CA | 90034 | OW0240313 | LIQUOR CLUB INC | THE SPOT GRILL | INACTIVE | PR0193589 |
| RESTAURANT (0-30) SEATS LOW RISK | 1630 | DASJI4LUR | 2017-12-29 | STREET CHURROS | 93 | A | 1 | ROUTINE INSPECTION | EE0000835 | 6801 HOLLYWOOD BLVD # 253 | LOS ANGELES | FA0224109 | CA | 90028 | OW0228670 | STREETCHURROS, INC | STREET CHURROS | ACTIVE | PR0179282 |
| RESTAURANT (0-30) SEATS MODERATE RISK | 1631 | DAJ00E07B | 2017-12-29 | HABITAT COFFEE SHOP | 95 | A | 1 | ROUTINE INSPECTION | EE0000923 | 3708 N EAGLE ROCK BLVD | LOS ANGELES | FA0170465 | CA | 90065 | OW0178123 | GLASSELL COFFEE SHOP LLC | HABITAT COFFEE SHOP | ACTIVE | PR0160774 |
| RESTAURANT (151 + ) SEATS HIGH RISK | 1641 | DA85ZLP32 | 2017-12-28 | THE FOUNDATION CENTER | 97 | A | 1 | ROUTINE INSPECTION | EE0000224 | 11633 S WESTERN AVE | LOS ANGELES | FA0165629 | CA | 90047 | OW0128872 | NOAH’S DWELLING, LLC | THE FOUNDATION CENTER | ACTIVE | PR0153961 |
| RESTAURANT (151 + ) SEATS LOW RISK | 1639 | DA1021441 | 2017-12-26 | WESTIN LAKEVIEW BISTRO | 94 | A | 1 | ROUTINE INSPECTION | EE0000721 | 404 S FIGUEROA ST LBBY LOBBY | LOS ANGELES | FA0073897 | CA | 90071 | OW0016394 | HOTEL BONAVENTURE LTD PARTNE | WESTIN LOBBY COURT | ACTIVE | PR0010971 |
| RESTAURANT (151 + ) SEATS MODERATE RISK | 1640 | DADCYUBZ0 | 2017-12-28 | SPECIALTY’S CAFE & BAKERY | 94 | A | 1 | ROUTINE INSPECTION | EE0000924 | 2121 AVE OF STARS STE #100 | LOS ANGELES | FA0033907 | CA | 90067 | OW0033774 | SPECIALTY’S CAFE & BAKERY, INC. | SPECIALTY’S CAFE & BAKERY | ACTIVE | PR0018808 |
| RESTAURANT (31-60) SEATS HIGH RISK | 1635 | DAHKJFBMR | 2017-12-28 | UPSTAIRS | 95 | A | 1 | ROUTINE INSPECTION | EE0000581 | 3707 N CAHUENGA BLVD | LOS ANGELES | FA0137987 | CA | 91604 | OW0101400 | CLARITY MANAGEMENT, INC. | UPSTAIRS | ACTIVE | PR0122211 |
| RESTAURANT (31-60) SEATS LOW RISK | 1633 | DAQMQMDOR | 2017-12-26 | ADMIRALS CLUB | 99 | A | 1 | ROUTINE INSPECTION | EE0000122 | 400 WORLD WAY | LOS ANGELES | FA0004930 | CA | 90045 | OW0125020 | AMERICAN AIRLINES, INC. | ADMIRALS CLUB CHAMPAGNE BAR | ACTIVE | PR0201945 |
| RESTAURANT (31-60) SEATS MODERATE RISK | 1634 | DAQKIE2OT | 2017-12-27 | SUBWAY -52105 | 95 | A | 1 | ROUTINE INSPECTION | EE0000721 | 255 S GRAND AVE STE 101 | LOS ANGELES | FA0034746 | CA | 90012 | OW0005591 | C G INVESTMENTS INC | SUBWAY -52105 | ACTIVE | PR0017785 |
| RESTAURANT (61-150) SEATS HIGH RISK | 1638 | DAKFCHD0L | 2017-12-29 | LE PAIN QUOTIDIEN | 93 | A | 1 | ROUTINE INSPECTION | EE0000629 | 13050 SAN VICENTE BLVD STE 114 | LOS ANGELES | FA0034788 | CA | 90049 | OW0028928 | PQ SAN VICENTE INC. | LE PAIN QUOTIDIEN | ACTIVE | PR0017456 |
| RESTAURANT (61-150) SEATS LOW RISK | 1636 | DAWVA0CY3 | 2017-12-29 | RIO GENTLEMANS CLUB | 93 | A | 1 | ROUTINE INSPECTION | EE0000958 | 13124 S FIGUEROA ST | LOS ANGELES | FA0046462 | CA | 90064 | OW0011830 | FIGUEROA GROUP INC | RIO GENTLEMANS CLUB | ACTIVE | PR0044776 |
| RESTAURANT (61-150) SEATS MODERATE RISK | 1637 | DAF0GFYS4 | 2017-12-28 | DELTA SKY CLUB T3 | 93 | A | 1 | ROUTINE INSPECTION | EE0000633 | 300 WORLD WAY | LOS ANGELES | FA0248072 | CA | 90045 | OW0242509 | DELTA AIR LINES, INC. | DELTA SKY CLUB T3 | ACTIVE | PR0196711 |
The ability to extract classification schema from a larger data set is a frequently required recipe. Outside of data science, one can easily do this process using Excel and a pivot table to get distinct values for a single column. It is even possible though non-trivial to obtain the unique 2-tuples. However, it is more difficult to get a representative row of an entire dataset in Excel using the pivot table approach for getting distinct values. Thus, the tidyverse recipe approach clearly shows more flexibility and power rather quickly.