Fuzzy Join Vignette

library(tidyverse)
library(fuzzyjoin)

Overview

This vignette will introduce the fuzzyjoin package, which enables joining of two datasets based on imperfect matches. This package is very helpful for combining data without unique keys.

We will use data related to candidates running in the 2022 election for the House of Representatives. Specifically, we’ll aim to join data from the Federal Election Committee (FEC) with data scraped from the Ballotpedia website. While the FEC data has a unique key (the FEC ID) for each candidate, the Ballotpedia data does not. So, we will need to rely primarily on candidate names to perform matches.

Candidate names, however, are not consistent across both sources. As we will see, names sometime include middle names and suffixes (e.g. Jr.) but other times do not. Some candidates are listed according to a nickname (Bill versus William) or initials (Jonathan Taylor versus JT). As a result, using names as a unique key would fail to capture many matches. If the names are relatively close, however, we can use fuzzy joins to get the job done!

Data

Our first dataset comes from the FEC API (https://api.open.fec.gov/developers/), listing all federally registered House candidates in the 2022 election cycle. I prepared the data ahead of time for simplicity, so we can just read it in here.

fec <- read_csv('input_data/fec_house_candidates_2022.csv')

## Rows: 3676 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): fec_id, name, state, district, party, office, incumbent_challenge, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(fec)

## Rows: 3,676
## Columns: 8
## $ fec_id              <chr> "H2CO07170", "H2UT03280", "H2CA30291", "H2MN07162"…
## $ name                <chr> "AADLAND, ERIK", "AALDERS, TIM", "AAZAMI, SHERVIN"…
## $ state               <chr> "CO", "UT", "CA", "MN", "NY", "MN", "OH", "CA", "N…
## $ district            <chr> "07", "03", "32", "07", "12", "05", "09", "50", "0…
## $ party               <chr> "REPUBLICAN PARTY", "REPUBLICAN PARTY", "DEMOCRATI…
## $ office              <chr> "House", "House", "House", "House", "House", "Hous…
## $ incumbent_challenge <chr> "Open seat", "Challenger", "Challenger", "Challeng…
## $ candidate_status    <chr> "P", "C", "N", "P", "P", "P", "N", "C", "N", "N", …

Our second dataset comes from Ballotpedia, a nonprofit online political encyclopedia. As above, I prepared the data ahead of time (see https://github.com/kac624/cuny/blob/main/D607/ballotpediaScrape.ipynb), so we can simply read it in here.

ballotpedia <- read_csv('input_data/ballotpedia_scrape2022.csv')

## Rows: 2289 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): name, party, state, last_election, twitter, ballotpedia
## dbl (1): district
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(ballotpedia)

## Rows: 2,289
## Columns: 7
## $ name          <chr> "Tony Gonzales", "Willie Montague", "Michelle Munroe", "…
## $ party         <chr> "Republican Party", "Republican Party", "Democratic Part…
## $ state         <chr> "Texas", "Florida", "Georgia", "Ohio", "Mississippi", "S…
## $ district      <dbl> 23, 10, 1, 8, 4, 5, 13, 2, 5, 8, 8, 4, 46, 5, 14, 1, 32,…
## $ last_election <chr> "8-Nov-22", "23-Aug-22", "24-May-22", "3-May-22", "7-Jun…
## $ twitter       <chr> "['https://www.twitter.com/RepTonyGonzales', 'https://ww…
## $ ballotpedia   <chr> "https://ballotpedia.org/Tony_Gonzales", "https://ballot…

We can already see some significant differences between the names across the two datasets. Let’s look at one specific example to illustrate these differences.

fec %>%
  filter(str_detect(name, 'ELLZEY')) %>%
  select(name, state, district, party)

## # A tibble: 1 × 4
##   name                   state district party           
##   <chr>                  <chr> <chr>    <chr>           
## 1 ELLZEY, JOHN KEVIN SR. TX    06       REPUBLICAN PARTY

ballotpedia %>%
  filter(str_detect(name, 'Ellzey')) %>%
  select(name, state, district, party)

## # A tibble: 1 × 4
##   name        state district party           
##   <chr>       <chr>    <dbl> <chr>           
## 1 Jake Ellzey Texas        6 Republican Party

In the FEC data, the Republican candidate for the 6th district of Texas is listed as John Kevin Ellzey Sr., but in the Ballotpedia data, he is listed as Jake Ellzey. We can use the party, state and district fields to guide our join, but ultimately, we’ll need to rely on imperfect name matches to connect these two datasets.

Cleaning and Prep

First, we’ll do some cleaning on the state and district fields of the Ballotpedia data so that it matches the format of the FEC data. We’ll also remove duplicate rows in the Ballotpedia data.

ballotpedia <- ballotpedia[!duplicated(ballotpedia),] %>%
  mutate(district = if_else(is.na(district), '00', sprintf("%02d", district)),
         state = str_replace_all(state, '_', ' '))

head(ballotpedia)

## # A tibble: 6 × 7
##   name            party            state         distr…¹ last_…² twitter ballo…³
##   <chr>           <chr>            <chr>         <chr>   <chr>   <chr>   <chr>  
## 1 Tony Gonzales   Republican Party Texas         23      8-Nov-… ['http… https:…
## 2 Willie Montague Republican Party Florida       10      23-Aug… ['http… https:…
## 3 Michelle Munroe Democratic Party Georgia       01      24-May… ['http… https:…
## 4 Phil Heimlich   Republican Party Ohio          08      3-May-… ['http… https:…
## 5 Carl Boyanton   Republican Party Mississippi   04      7-Jun-… ['http… https:…
## 6 Larry Gaither   Green Party      South Caroli… 05      8-Nov-… []      https:…
## # … with abbreviated variable names ¹district, ²last_election, ³ballotpedia

We’ll do some additional cleaning to get the party and name columns as close as possible. For the party column, we’ll simplify things by listing party as one of three options: Republican, Democrat or Other. For the name column, we’ll actually create a new column (named key) with all lowercase letters and a number of “noisy” items removed (punctuation, suffixes, prefixes and duplicate spaces). Removing this noise will support more accurate matches in the fuzzy join.

fec <- fec %>%
  mutate(party_simple = case_when(party == 'REPUBLICAN PARTY' ~ 'Republican',
                                  party == 'DEMOCRATIC PARTY' ~ 'Democrat',
                                  TRUE ~ 'Other'),
         key = case_when(!str_detect(name,',') ~ name,
                          TRUE ~ str_c(str_extract(name, '(?<=, ).*'), ' ',
                                       str_extract(name, '.*(?=,)'))) %>%
           str_remove_all(str_c(
             '(MS\\.|MRS\\.|MR\\.|DR\\.|JR\\.|JR|SR\\.|III|II|',
             '\\b[A-Z]\\.\\b|\\b[A-Z]\\b|É|[0-9])')) %>%
           str_remove_all('[[:punct:]]') %>%
           str_replace_all('  ', ' ') %>%
           trimws() %>%
           tolower(),
         key = str_c(str_extract(key, '^[A-Za-z]+'),
                 str_extract(key, ' [A-Za-z]+$')))

ballotpedia <- ballotpedia %>%
  mutate(state = state.abb[match(state,state.name)],
         party_simple = case_when(party == 'Republican Party' ~ 'Republican',
                                  party == 'Democratic Party' ~ 'Democrat',
                                  TRUE ~ 'Other'),
         key = str_remove_all(name, '(III|II|Jr\\.|Sr\\.|\\b[A-Z]\\.\\b|é)') %>%
           str_remove_all('[[:punct:]]') %>%
           str_replace_all('  ', ' ') %>%
           trimws() %>%
           tolower(),
         key = str_c(str_extract(key, '^[A-Za-z]+'),
                 str_extract(key, ' [A-Za-z]+$')))

fec[is.na(fec)] <- ''
ballotpedia[is.na(ballotpedia)] <- ''

fec$key[1:5]

## [1] "erik aadland"    "tim aalders"     "shervin aazami"  "jill abahsain"  
## [5] "rana abdelhamid"

ballotpedia$key[1:5]

## [1] "tony gonzales"   "willie montague" "michelle munroe" "phil heimlich"  
## [5] "carl boyanton"

With our names cleaned, we can now try our first fuzzy join! While the fuzzyjoin package offers a number of functions, we’ll focus on the stringdist_join function, which uses imperfect string matches. For this example, we want to compare the number of matches using different fuzzy parameters, so we’ll set the join mode to inner (this parameter takes the typical join types, such as left, right and full).

The most critical arguments are method and max_dist. The method argument determines the mathematical approach for identifying matches. The various options are all detailed in documentation for the stringdist package (see https://search.r-project.org/CRAN/refmans/stringdist/html/stringdist-metrics.html). We’ll use the lv option for now, which corresponds to Levenshtein distance, a function that counts the number of deletions, insertions and substitutions necessary to turn string b into string a.

The max_dist argument serves as a threshold for how fuzzy we want our matches to be. A higher max_dist means that more dissimilar strings will still count as matches. Conversely, a lower max_dist will result in only strings that are very similar counting as matches. We add the distance_col argument to create a new column that details the distance of the two strings matched. We’ll set this to 3 for now.

fuzzy1 <- fec %>%
  stringdist_join(
    select(ballotpedia, name, key),
    by = 'key',
    mode = 'inner',
    method = 'lv',
    max_dist = 3,
    ignore_case = TRUE,
    distance_col = 'distance')

nrow(fuzzy1)

## [1] 2915

fuzzy1 %>%
  filter(distance > 2) %>%
  select(name.x, key.x, name.y, key.y) %>%
  arrange(key.x) %>%
  head(10)

## # A tibble: 10 × 4
##    name.x              key.x       name.y        key.y       
##    <chr>               <chr>       <chr>         <chr>       
##  1 JAY, AARON MR       aaron jay   Aaron Bean    aaron bean  
##  2 LEE, AARON PAYTON   aaron lee   Aaron Bean    aaron bean  
##  3 SIMS, AARON MICHAEL aaron sims  Alan Sims     alan sims   
##  4 SMITH, AARON        aaron smith Adrian Smith  adrian smith
##  5 SMITH, AARON        aaron smith Ian Smith     ian smith   
##  6 MARTIN, ADAM        adam martin Adam Morgan   adam morgan 
##  7 MORGAN, ADAM        adam morgan Adam Martin   adam martin 
##  8 ORTIZ, ADAM         adam ortiz  Adam Martin   adam martin 
##  9 SMITH, D. ADAM      adam smith  D. Adam Smith d smith     
## 10 SMITH, D. ADAM      adam smith  Adrian Smith  adrian smith

We see nearly 3000 matches, which significantly exceeds the total number of observations in the Ballotpedia data. Moreover, when we view the matches with the greatest distance (3), we see some erroneous matches. Aaron Bean has matched with both Arron Jay and Aaron Lee.

So, let’s try again, but with a lower threshold, setting max_dist to 1.

fuzzy2 <- fec %>%
  stringdist_join(
    select(ballotpedia, name, key),
    by = 'key',
    mode = 'inner',
    method = 'lv',
    max_dist = 1,
    ignore_case = TRUE,
    distance_col = 'distance')

nrow(fuzzy2)

## [1] 2002

fuzzy2 %>%
  select(name.x, key.x, name.y, key.y) %>%
  arrange(key.x) %>%
  head(10)

## # A tibble: 10 × 4
##    name.x                key.x              name.y             key.y            
##    <chr>                 <chr>              <chr>              <chr>            
##  1 RHODES, AARIKA SAMONE aarika rhodes      Aarika Rhodes      aarika rhodes    
##  2 BEAN, AARON P.        aaron bean         Aaron Bean         aaron bean       
##  3 CALKINS, AARON (A.J.) aaron calkins      Aaron Calkins      aaron calkins    
##  4 HEINEMAN, AARON       aaron heineman     Aaron Heineman     aaron heineman   
##  5 STORER, AARON         aaron storer       Aaron Storer       aaron storer     
##  6 MULUGHETA, ABEL       abel mulugheta     Abel Mulugheta     abel mulugheta   
##  7 GARAPATI, ABHIRAM     abhiram garapati   Abhiram Garapati   abhiram garapati 
##  8 SPANBERGER, ABIGAIL   abigail spanberger Abigail Spanberger abigail spanberg…
##  9 FRISCH, ADAM          adam frisch        Adam Frisch        adam frisch      
## 10 GRAY, ADAM C.         adam gray          Adam Gray          adam gray

With a lower threshold these matches appear more accurate. However, we also appear to fail to match a number of candidates, as our joined dataframe has only ~2000 rows, compared to 2289 in the Ballotpedia dataset.

Perhaps there is a middle ground? We can increase the threshold again, but this time use “helper columns” to get rid of erroneous matches. Specifically, we can use the state, district and party columns to further refine our join. Once the join is complete, we then filter for those rows in which these three columns match.

fuzzy3 <- fec %>%
  stringdist_join(
    select(ballotpedia, name, key, state, district, party_simple),
    by = 'key',
    mode = 'inner',
    method = 'lv',
    max_dist = 3,
    ignore_case = TRUE,
    distance_col = 'distance') %>%
  filter(state.x == state.y,
         district.x == district.y, 
         party_simple.x == party_simple.y)

nrow(fuzzy3)

## [1] 2005

fuzzy3 %>%
  filter(distance > 2) %>%
  select(name.x, key.x, name.y, key.y) %>%
  arrange(key.x) %>%
  head(10)

## # A tibble: 10 × 4
##    name.x                      key.x            name.y         key.y         
##    <chr>                       <chr>            <chr>          <chr>         
##  1 SMITH, D. ADAM              adam smith       D. Adam Smith  d smith       
##  2 THAKUR, ALBERT 'BURT'       albert thakur    Burt Thakur    burt thakur   
##  3 HORNING, ANDREW             andrew horning   Andy Horning   andy horning  
##  4 HORNING, ANDREW M.          andrew horning   Andy Horning   andy horning  
##  5 CRAIG, ANGELA DAWN          angela craig     Angie Craig    angie craig   
##  6 NORMOYLE, ANGELA H.         angela normoyle  Angie Normoyle angie normoyle
##  7 COWDEN, ANTHONY 'TONY'      anthony cowden   Tony Cowden    tony cowden   
##  8 VARGAS, ANTHONY 'TONY'      anthony vargas   Tony Vargas    tony vargas   
##  9 PETERSON, ARTHUR LAVERNE DR arthur peterson  Art Peterson   art peterson  
## 10 GOLDBERG, BARTLEY F.        bartley goldberg Bart Goldberg  bart goldberg

We see now that even our most distance matches appear accurate. The distance allows for names that diverge significantly in terms of formatting to still be matched. However, we still only have ~2000 matches, so we’re missing just under 300 candidates. What if we increase the threshold further, while still using our helper columns?

fuzzy4 <- fec %>%
  stringdist_join(
    select(ballotpedia, name, key, state, district, party_simple),
    by = 'key',
    mode = 'inner',
    method = 'lv',
    max_dist = 6,
    ignore_case = TRUE,
    distance_col = 'distance') %>%
  filter(state.x == state.y,
         district.x == district.y, 
         party_simple.x == party_simple.y)

nrow(fuzzy4)

## [1] 2198

fuzzy4 %>%
  filter(distance > 4) %>%
  select(name.x, key.x, name.y, key.y) %>%
  arrange(key.x) %>%
  head(10)

## # A tibble: 10 × 4
##    name.x                     key.x              name.y           key.y        
##    <chr>                      <chr>              <chr>            <chr>        
##  1 MAKROM, AHMED MATTHEW      ahmed makrom       Matt Makrom      matt makrom  
##  2 COHN, ALAN                 alan cohn          Gavin Brown      gavin brown  
##  3 FERGUSON, ANDERSON DREW IV anderson ferguson  Drew Ferguson    drew ferguson
##  4 LOUDERBACK, ANDREW JOHN    andrew louderback  A.J. Louderback  j louderback 
##  5 PHELPS, ANDREW H.          andrew phelps      Andrew Halcro    andrew halcro
##  6 JOHNSON, ARTHUR WAYNE      arthur johnson     A. Wayne Johnson a johnson    
##  7 ARENHOLZ, ASHLEY HINSON    ashley arenholz    Ashley Hinson    ashley hinson
##  8 BEARDSLEY, BENJAMIN RYAN   benjamin beardsley Ben Beardsley    ben beardsley
##  9 SMILOWITZ, BENJAMIN        benjamin smilowitz Ben Smilowitz    ben smilowitz
## 10 BEAL, BRIAN T.             brian beal         Tim Beck         tim beck

Finally, we’ve gotten very close to a full match for all 2289 candidates in the Ballotpedia data. Moreover, even our most distant matches (those above 4) appear accurate. We do, however, see one match that appears inaccurate: Brian Beal has matched with Tim Beck.

We could continue to tweak our join criteria to refine this join, but ultimately, we need to aim for a “sweet spot” where we feel comfortable with the accuracy of our matches without losing meaningful data. That requires iteration, detailed review of results, and some degree of judgment.

Conclusion

I hope this vignette has demonstrated the value of the fuzzyjoin package, especially when dealing with unstructured / scraped data. The package offers many more functions, so opportunities abound for extension of this vignette!

Extend fuzzyjoin Package Example

Working with Keith’s example using the fuzzyjoin library, I will be further showcasing what this package can do using a geo_left_join function based on geo-location data and the finding the distance between them.

The data for this example will be using MTA subway station locations, which provides latitude and longitude for every subway exit and its subway line. In addition, I will be using data on NYC’s public acute care hospitals throughout the NYC region. This dataset also provides latitude and longitude information.

The goal will be to determine which subway lines are at most one-mile away from the hospital.

Import MTA data

Here, the MTA data is imported and cleaned up to be joined later on with the NYC hospital data.

library(janitor)

mta_df <- read_csv('input_data/mta_subway_stations.csv') |> 
  clean_names()

station_locations <- 
  mta_df |> 
  distinct(route1, station_latitude, station_longitude) |> 
  rename(train_route = route1, latitude = station_latitude, longitude = station_longitude)

Import NYC Hospital Data

We continue with cleaning of the data so that it can be joined with the MTA data.

hhc_df <- read_csv('input_data/nyc_hhc.csv') |> 
  clean_names()

hhc_locations <-
  hhc_df |> 
  filter(facility_type == 'Acute Care Hospital') |> 
  mutate(lat_long = str_split_i(location_1, "\\(", 2),
         latitude = as.numeric(str_split_i(lat_long, "\\,", 1)),
         longitude = str_split_i(lat_long, "\\,", 2),
         longitude = as.numeric(str_remove(longitude, "\\)"))) |> 
  select(facility_name, latitude, longitude)

Fuzzyjoin

We can use fuzzyjoin and the geo_left_join() function to combine both datasets and determine a maximum distance on one-mile apart from the hospital and subway exit. It is important to note that subway lines have multiple exits and may vary how far from the hospital it will be. The data merge will be filtered to find the shortest exit path from the subway line to the hospital.

station_hhc <-
  hhc_locations |> 
  geo_left_join(station_locations, max_dist = 1, distance_col = 'dist_to_hosp')

station_hhc_short <-
  station_hhc |>
  group_by(facility_name, train_route) |> 
  filter(dist_to_hosp == min(dist_to_hosp)) |>
  mutate(dist_to_hosp = round(dist_to_hosp, 2)) |> 
  ungroup()
  
station_hhc_short <-  
  station_hhc_short |> 
  group_by(facility_name) |> 
  arrange(facility_name, train_route) |> 
  mutate(all_trains = str_c(train_route, collapse = ", ")) |> 
  ungroup()

Leaflet Plot

Lastly, lets plot on an interactive map the location of the public hospital using black circles and the location of the subway exit in orange circles. When clicking on the hospital location, it will provide which trains are within a one-mile radius and the subway exit marker will provide the hospital and the distance to it.

library(leaflet)

leaflet(station_hhc_short) |> 
  addTiles() |> 
  addCircles(lng = ~longitude.x, 
             lat = ~latitude.x, 
             radius = 30, 
             weight = 1,
             color = 'Black',
             fillColor = 'Black',
             fillOpacity = 0.05,
             popup = paste("<b>Hospital: </b>", station_hhc_short$facility_name, "<br>",
                           "<b>Subway Lines: </b>", station_hhc_short$all_trains)
             ) |> 
  addCircles(lng = ~longitude.y, 
             lat = ~latitude.y, 
             radius = 15, 
             weight = 1,
             color = '#c85200',
             fillColor = '#c85200',
             popup = paste("<b>Subway Line: </b>", station_hhc_short$train_route, "<br>",
                           "<b>Hospital: </b>", station_hhc_short$facility_name, "<br>",
                           "<b>Distance: </b>", station_hhc_short$dist_to_hosp, "mi.")
             ) |> 
  addProviderTiles("CartoDB.Positron") |> 
  setView(-73.98, 40.75, zoom = 10)

Conclusion

Fuzzyjoin is not limited to finding similar names but also geo-location aspects.

Fuzzy Join Vignette

Keith Colella, John Cruz

2023-04-24

Overview

Data

Cleaning and Prep

Conclusion

Extend fuzzyjoin Package Example

Import MTA data

Import NYC Hospital Data

Fuzzyjoin

Leaflet Plot

Conclusion