These examples will demonstrate the use of Tidyverse inner and outer (left, right, full) joins. The dataset I used for the demonstration is a partial download of French deaths by département (state) for the year 2019.1
# Load the départements dataset, converting the département code to int
df_dept <- read.csv("https://raw.githubusercontent.com/mmippolito/cuny/main/data607/tidyverse/departements.csv", sep = ';')
df_dept$departement_code <- as.integer(df_dept$departement_code)
## Warning: NAs introduced by coercion
head(df_dept) %>% reactable
# Load the decedents dataset
# For illustrative purposes, I filtered the table to only deaths of people
# aged 103 or older and who died on 1/1/2019. This give us 8 rows:
df_filt <- read.csv("https://raw.githubusercontent.com/mmippolito/cuny/main/data607/tidyverse/deces_2019_filt.csv", sep = ';')
df_filt$departement_deces <- as.integer(df_filt$departement_deces)
head(df_filt) %>% reactable
Inner joins only choose rows where the selected keyed field is equal in each table.
# Join the filtered decedents table with the départements table:
df_inner <- df_filt %>%
inner_join(df_dept, by = c("departement_deces" = "departement_code"))
df_inner %>%
reactable()
If there hadn’t been a corresponding departement in the departements table, that row would have been ommitted in the resulting dataset. For example, after deliberately filter out the département de Meurthe-et-Moselle from the departements table; now only 7 rows are shown.
# Filter out one of the départements and run the join again:
df_dept_filt <- df_dept %>% filter(departement != "Meurthe-et-Moselle")
df_inner2 <- df_filt %>%
inner_join(df_dept_filt, by = c("departement_deces" = "departement_code"))
df_inner2 %>%
reactable()
Left outer joins choose all rows from the left-hand table and only the data from the right-hand table where the keyed field is equal.
# Left join the filtered decedents table with the filtered départements table.
# Note how all 8 rows of the left table (decedents) are present, but the first département
# is missing (since we filtered it from the départements table).
df_left <- df_filt %>%
left_join(df_dept_filt, by = c("departement_deces" = "departement_code"))
df_left%>%
reactable()
Right outer joins choose all rows from the right-hand table and only the data from the left-hand table where the keyed field is equal.
# Right join the filtered decedents table with the filtered départements table.
# Now we have all rows from the départements table, but the decendent data is only
# populated in the rows from the 8 rows in the decendents table.
df_right <- df_filt %>%
right_join(df_dept_filt, by = c("departement_deces" = "departement_code"))
df_right %>%
reactable()
Full outer joins choose all rows from both the left-hand and right-hand tables.
# Full join the filtered decedents table with the filtered départements table.
# Note how all 8 rows of the left table (decedents) are present, but the first département
# is missing (since we filtered it from the départements table).
# Note how even the row with the missing département is displayed.
df_full <- df_filt %>%
full_join(df_dept_filt, by = c("departement_deces" = "departement_code"))
df_full%>%
reactable()
The fuzzyjoin package is a variation on dplyr’s join operations that allows matching not just on values that match between columns, but on inexact matching. This allows matching on:
Numeric values that are within some tolerance (difference_inner_join)
Strings that are similar in Levenshtein/cosine/Jaccard distance, or other metrics from the stringdist package (stringdist_inner_join)
A regular expression in one column matching to another (regex_inner_join)
Euclidean or Manhattan distance across multiple columns (distance_inner_join) - Geographic distance based on longitude and latitude (geo_inner_join)
Intervals of (start, end) that overlap (interval_inner_join)
Genomic intervals, which include both a chromosome ID and (start, end) pairs, that overlap (genome_inner_join)
One relevant use case is for classifying freeform text data (such as survey responses) against a finite set of options.2
I will be scraping the following table from Wikipedia3. I will use this data to do a fuzzyjoin with Michael’s dataset.
url <- "https://en.wikipedia.org/wiki/List_of_French_departments_by_population"
frenchDepts <- read_html(url) %>%
html_nodes(xpath="/html/body/div[3]/div[3]/div[5]/div[1]/table") %>%
html_table()
frenchDepts <- frenchDepts[[1]]
frenchDepts %>%
reactable()
I will now perform an fuzzy inner join on the department columns. Note even though the spellings of the Departments are not consistent, matches are made.
N.B. The stringdist function by default uses the Optimal String Alignment distance between the columns upon which we want to join, and will join where that distance is less than 2. These are default parameters, the distance function used and the maximum acceptable distance can be set as needed. Also by default, an inner join is performed, other options are avialble using the mode parameter.
strdist_join <- frenchDepts %>%
stringdist_join(df_dept, by=c("Department" = "departement"))
strdist_join %>%
reactable()
After performing the inner join I will perform an anti-join to see what did not match.
frenchDepts %>%
stringdist_anti_join(df_dept, by=c("Department" = "departement")) %>%
reactable
I will now show the general fuzzy join function. I will feed it a custom matching function. This code performs the same task as the stringdist_inner_join, this is just to illustrate the custom matching function
match_fun <- function(v1,v2){
v1 <- str_to_lower(v1)
v2 <- str_to_lower(v2)
stringdist::stringdist(v1,v2) <= 2
}
frenchDepts %>%
fuzzy_join(df_dept, by=c("Department" = "departement"), match_fun = match_fun) %>%
reactable()
I will now show the general fuzzy anti join function.
match_fun <- function(v1,v2){
v1 <- str_to_lower(v1)
v2 <- str_to_lower(v2)
stringdist::stringdist(v1,v2) <= 2
}
frenchDepts %>%
fuzzy_anti_join(df_dept, by=c("Department" = "departement"), match_fun = match_fun) %>%
reactable()