library(tidyverse)
library(fuzzyjoin)
This vignette will introduce the fuzzyjoin
package,
which enables joining of two datasets based on imperfect matches. This
package is very helpful for combining data without unique
keys.
We will use data related to candidates running in the 2022 election for the House of Representatives. Specifically, we’ll aim to join data from the Federal Election Committee (FEC) with data scraped from the Ballotpedia website. While the FEC data has a unique key (the FEC ID) for each candidate, the Ballotpedia data does not. So, we will need to rely primarily on candidate names to perform matches.
Candidate names, however, are not consistent across both sources. As we will see, names sometime include middle names and suffixes (e.g. Jr.) but other times do not. Some candidates are listed according to a nickname (Bill versus William) or initials (Jonathan Taylor versus JT). As a result, using names as a unique key would fail to capture many matches. If the names are relatively close, however, we can use fuzzy joins to get the job done!
Our first dataset comes from the FEC API (https://api.open.fec.gov/developers/), listing all federally registered House candidates in the 2022 election cycle. I prepared the data ahead of time for simplicity, so we can just read it in here.
fec <- read_csv('input_data/fec_house_candidates_2022.csv')
## Rows: 3676 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): fec_id, name, state, district, party, office, incumbent_challenge, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(fec)
## Rows: 3,676
## Columns: 8
## $ fec_id <chr> "H2CO07170", "H2UT03280", "H2CA30291", "H2MN07162"…
## $ name <chr> "AADLAND, ERIK", "AALDERS, TIM", "AAZAMI, SHERVIN"…
## $ state <chr> "CO", "UT", "CA", "MN", "NY", "MN", "OH", "CA", "N…
## $ district <chr> "07", "03", "32", "07", "12", "05", "09", "50", "0…
## $ party <chr> "REPUBLICAN PARTY", "REPUBLICAN PARTY", "DEMOCRATI…
## $ office <chr> "House", "House", "House", "House", "House", "Hous…
## $ incumbent_challenge <chr> "Open seat", "Challenger", "Challenger", "Challeng…
## $ candidate_status <chr> "P", "C", "N", "P", "P", "P", "N", "C", "N", "N", …
Our second dataset comes from Ballotpedia, a nonprofit online political encyclopedia. As above, I prepared the data ahead of time (see https://github.com/kac624/cuny/blob/main/D607/ballotpediaScrape.ipynb), so we can simply read it in here.
ballotpedia <- read_csv('input_data/ballotpedia_scrape2022.csv')
## Rows: 2289 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): name, party, state, last_election, twitter, ballotpedia
## dbl (1): district
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(ballotpedia)
## Rows: 2,289
## Columns: 7
## $ name <chr> "Tony Gonzales", "Willie Montague", "Michelle Munroe", "…
## $ party <chr> "Republican Party", "Republican Party", "Democratic Part…
## $ state <chr> "Texas", "Florida", "Georgia", "Ohio", "Mississippi", "S…
## $ district <dbl> 23, 10, 1, 8, 4, 5, 13, 2, 5, 8, 8, 4, 46, 5, 14, 1, 32,…
## $ last_election <chr> "8-Nov-22", "23-Aug-22", "24-May-22", "3-May-22", "7-Jun…
## $ twitter <chr> "['https://www.twitter.com/RepTonyGonzales', 'https://ww…
## $ ballotpedia <chr> "https://ballotpedia.org/Tony_Gonzales", "https://ballot…
We can already see some significant differences between the names across the two datasets. Let’s look at one specific example to illustrate these differences.
fec %>%
filter(str_detect(name, 'ELLZEY')) %>%
select(name, state, district, party)
## # A tibble: 1 × 4
## name state district party
## <chr> <chr> <chr> <chr>
## 1 ELLZEY, JOHN KEVIN SR. TX 06 REPUBLICAN PARTY
ballotpedia %>%
filter(str_detect(name, 'Ellzey')) %>%
select(name, state, district, party)
## # A tibble: 1 × 4
## name state district party
## <chr> <chr> <dbl> <chr>
## 1 Jake Ellzey Texas 6 Republican Party
In the FEC data, the Republican candidate for the 6th district of Texas is listed as John Kevin Ellzey Sr., but in the Ballotpedia data, he is listed as Jake Ellzey. We can use the party, state and district fields to guide our join, but ultimately, we’ll need to rely on imperfect name matches to connect these two datasets.
First, we’ll do some cleaning on the state and district fields of the Ballotpedia data so that it matches the format of the FEC data. We’ll also remove duplicate rows in the Ballotpedia data.
ballotpedia <- ballotpedia[!duplicated(ballotpedia),] %>%
mutate(district = if_else(is.na(district), '00', sprintf("%02d", district)),
state = str_replace_all(state, '_', ' '))
head(ballotpedia)
## # A tibble: 6 × 7
## name party state distr…¹ last_…² twitter ballo…³
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Tony Gonzales Republican Party Texas 23 8-Nov-… ['http… https:…
## 2 Willie Montague Republican Party Florida 10 23-Aug… ['http… https:…
## 3 Michelle Munroe Democratic Party Georgia 01 24-May… ['http… https:…
## 4 Phil Heimlich Republican Party Ohio 08 3-May-… ['http… https:…
## 5 Carl Boyanton Republican Party Mississippi 04 7-Jun-… ['http… https:…
## 6 Larry Gaither Green Party South Caroli… 05 8-Nov-… [] https:…
## # … with abbreviated variable names ¹district, ²last_election, ³ballotpedia
We’ll do some additional cleaning to get the party and name columns
as close as possible. For the party column, we’ll simplify things by
listing party as one of three options: Republican, Democrat or Other.
For the name column, we’ll actually create a new column (named
key
) with all lowercase letters and a number of “noisy”
items removed (punctuation, suffixes, prefixes and duplicate spaces).
Removing this noise will support more accurate matches in the fuzzy
join.
fec <- fec %>%
mutate(party_simple = case_when(party == 'REPUBLICAN PARTY' ~ 'Republican',
party == 'DEMOCRATIC PARTY' ~ 'Democrat',
TRUE ~ 'Other'),
key = case_when(!str_detect(name,',') ~ name,
TRUE ~ str_c(str_extract(name, '(?<=, ).*'), ' ',
str_extract(name, '.*(?=,)'))) %>%
str_remove_all(str_c(
'(MS\\.|MRS\\.|MR\\.|DR\\.|JR\\.|JR|SR\\.|III|II|',
'\\b[A-Z]\\.\\b|\\b[A-Z]\\b|É|[0-9])')) %>%
str_remove_all('[[:punct:]]') %>%
str_replace_all(' ', ' ') %>%
trimws() %>%
tolower(),
key = str_c(str_extract(key, '^[A-Za-z]+'),
str_extract(key, ' [A-Za-z]+$')))
ballotpedia <- ballotpedia %>%
mutate(state = state.abb[match(state,state.name)],
party_simple = case_when(party == 'Republican Party' ~ 'Republican',
party == 'Democratic Party' ~ 'Democrat',
TRUE ~ 'Other'),
key = str_remove_all(name, '(III|II|Jr\\.|Sr\\.|\\b[A-Z]\\.\\b|é)') %>%
str_remove_all('[[:punct:]]') %>%
str_replace_all(' ', ' ') %>%
trimws() %>%
tolower(),
key = str_c(str_extract(key, '^[A-Za-z]+'),
str_extract(key, ' [A-Za-z]+$')))
fec[is.na(fec)] <- ''
ballotpedia[is.na(ballotpedia)] <- ''
fec$key[1:5]
## [1] "erik aadland" "tim aalders" "shervin aazami" "jill abahsain"
## [5] "rana abdelhamid"
ballotpedia$key[1:5]
## [1] "tony gonzales" "willie montague" "michelle munroe" "phil heimlich"
## [5] "carl boyanton"
With our names cleaned, we can now try our first fuzzy join! While
the fuzzyjoin
package offers a number of functions, we’ll
focus on the stringdist_join
function, which uses imperfect
string matches. For this example, we want to compare the number of
matches using different fuzzy parameters, so we’ll set the join mode to
inner
(this parameter takes the typical join types, such as
left
, right
and full
).
The most critical arguments are method
and
max_dist
. The method
argument determines the
mathematical approach for identifying matches. The various options are
all detailed in documentation for the stringdist
package
(see https://search.r-project.org/CRAN/refmans/stringdist/html/stringdist-metrics.html).
We’ll use the lv
option for now, which corresponds to
Levenshtein distance, a function that counts the number of deletions,
insertions and substitutions necessary to turn string b
into string a
.
The max_dist
argument serves as a threshold for how
fuzzy we want our matches to be. A higher max_dist
means that more dissimilar strings will still count as matches.
Conversely, a lower max_dist
will result in only strings
that are very similar counting as matches. We add the
distance_col
argument to create a new column that details
the distance of the two strings matched. We’ll set this to 3 for
now.
fuzzy1 <- fec %>%
stringdist_join(
select(ballotpedia, name, key),
by = 'key',
mode = 'inner',
method = 'lv',
max_dist = 3,
ignore_case = TRUE,
distance_col = 'distance')
nrow(fuzzy1)
## [1] 2915
fuzzy1 %>%
filter(distance > 2) %>%
select(name.x, key.x, name.y, key.y) %>%
arrange(key.x) %>%
head(10)
## # A tibble: 10 × 4
## name.x key.x name.y key.y
## <chr> <chr> <chr> <chr>
## 1 JAY, AARON MR aaron jay Aaron Bean aaron bean
## 2 LEE, AARON PAYTON aaron lee Aaron Bean aaron bean
## 3 SIMS, AARON MICHAEL aaron sims Alan Sims alan sims
## 4 SMITH, AARON aaron smith Adrian Smith adrian smith
## 5 SMITH, AARON aaron smith Ian Smith ian smith
## 6 MARTIN, ADAM adam martin Adam Morgan adam morgan
## 7 MORGAN, ADAM adam morgan Adam Martin adam martin
## 8 ORTIZ, ADAM adam ortiz Adam Martin adam martin
## 9 SMITH, D. ADAM adam smith D. Adam Smith d smith
## 10 SMITH, D. ADAM adam smith Adrian Smith adrian smith
We see nearly 3000 matches, which significantly exceeds the total number of observations in the Ballotpedia data. Moreover, when we view the matches with the greatest distance (3), we see some erroneous matches. Aaron Bean has matched with both Arron Jay and Aaron Lee.
So, let’s try again, but with a lower threshold, setting
max_dist
to 1.
fuzzy2 <- fec %>%
stringdist_join(
select(ballotpedia, name, key),
by = 'key',
mode = 'inner',
method = 'lv',
max_dist = 1,
ignore_case = TRUE,
distance_col = 'distance')
nrow(fuzzy2)
## [1] 2002
fuzzy2 %>%
select(name.x, key.x, name.y, key.y) %>%
arrange(key.x) %>%
head(10)
## # A tibble: 10 × 4
## name.x key.x name.y key.y
## <chr> <chr> <chr> <chr>
## 1 RHODES, AARIKA SAMONE aarika rhodes Aarika Rhodes aarika rhodes
## 2 BEAN, AARON P. aaron bean Aaron Bean aaron bean
## 3 CALKINS, AARON (A.J.) aaron calkins Aaron Calkins aaron calkins
## 4 HEINEMAN, AARON aaron heineman Aaron Heineman aaron heineman
## 5 STORER, AARON aaron storer Aaron Storer aaron storer
## 6 MULUGHETA, ABEL abel mulugheta Abel Mulugheta abel mulugheta
## 7 GARAPATI, ABHIRAM abhiram garapati Abhiram Garapati abhiram garapati
## 8 SPANBERGER, ABIGAIL abigail spanberger Abigail Spanberger abigail spanberg…
## 9 FRISCH, ADAM adam frisch Adam Frisch adam frisch
## 10 GRAY, ADAM C. adam gray Adam Gray adam gray
With a lower threshold these matches appear more accurate. However, we also appear to fail to match a number of candidates, as our joined dataframe has only ~2000 rows, compared to 2289 in the Ballotpedia dataset.
Perhaps there is a middle ground? We can increase the threshold again, but this time use “helper columns” to get rid of erroneous matches. Specifically, we can use the state, district and party columns to further refine our join. Once the join is complete, we then filter for those rows in which these three columns match.
fuzzy3 <- fec %>%
stringdist_join(
select(ballotpedia, name, key, state, district, party_simple),
by = 'key',
mode = 'inner',
method = 'lv',
max_dist = 3,
ignore_case = TRUE,
distance_col = 'distance') %>%
filter(state.x == state.y,
district.x == district.y,
party_simple.x == party_simple.y)
nrow(fuzzy3)
## [1] 2005
fuzzy3 %>%
filter(distance > 2) %>%
select(name.x, key.x, name.y, key.y) %>%
arrange(key.x) %>%
head(10)
## # A tibble: 10 × 4
## name.x key.x name.y key.y
## <chr> <chr> <chr> <chr>
## 1 SMITH, D. ADAM adam smith D. Adam Smith d smith
## 2 THAKUR, ALBERT 'BURT' albert thakur Burt Thakur burt thakur
## 3 HORNING, ANDREW andrew horning Andy Horning andy horning
## 4 HORNING, ANDREW M. andrew horning Andy Horning andy horning
## 5 CRAIG, ANGELA DAWN angela craig Angie Craig angie craig
## 6 NORMOYLE, ANGELA H. angela normoyle Angie Normoyle angie normoyle
## 7 COWDEN, ANTHONY 'TONY' anthony cowden Tony Cowden tony cowden
## 8 VARGAS, ANTHONY 'TONY' anthony vargas Tony Vargas tony vargas
## 9 PETERSON, ARTHUR LAVERNE DR arthur peterson Art Peterson art peterson
## 10 GOLDBERG, BARTLEY F. bartley goldberg Bart Goldberg bart goldberg
We see now that even our most distance matches appear accurate. The distance allows for names that diverge significantly in terms of formatting to still be matched. However, we still only have ~2000 matches, so we’re missing just under 300 candidates. What if we increase the threshold further, while still using our helper columns?
fuzzy4 <- fec %>%
stringdist_join(
select(ballotpedia, name, key, state, district, party_simple),
by = 'key',
mode = 'inner',
method = 'lv',
max_dist = 6,
ignore_case = TRUE,
distance_col = 'distance') %>%
filter(state.x == state.y,
district.x == district.y,
party_simple.x == party_simple.y)
nrow(fuzzy4)
## [1] 2198
fuzzy4 %>%
filter(distance > 4) %>%
select(name.x, key.x, name.y, key.y) %>%
arrange(key.x) %>%
head(10)
## # A tibble: 10 × 4
## name.x key.x name.y key.y
## <chr> <chr> <chr> <chr>
## 1 MAKROM, AHMED MATTHEW ahmed makrom Matt Makrom matt makrom
## 2 COHN, ALAN alan cohn Gavin Brown gavin brown
## 3 FERGUSON, ANDERSON DREW IV anderson ferguson Drew Ferguson drew ferguson
## 4 LOUDERBACK, ANDREW JOHN andrew louderback A.J. Louderback j louderback
## 5 PHELPS, ANDREW H. andrew phelps Andrew Halcro andrew halcro
## 6 JOHNSON, ARTHUR WAYNE arthur johnson A. Wayne Johnson a johnson
## 7 ARENHOLZ, ASHLEY HINSON ashley arenholz Ashley Hinson ashley hinson
## 8 BEARDSLEY, BENJAMIN RYAN benjamin beardsley Ben Beardsley ben beardsley
## 9 SMILOWITZ, BENJAMIN benjamin smilowitz Ben Smilowitz ben smilowitz
## 10 BEAL, BRIAN T. brian beal Tim Beck tim beck
Finally, we’ve gotten very close to a full match for all 2289 candidates in the Ballotpedia data. Moreover, even our most distant matches (those above 4) appear accurate. We do, however, see one match that appears inaccurate: Brian Beal has matched with Tim Beck.
We could continue to tweak our join criteria to refine this join, but ultimately, we need to aim for a “sweet spot” where we feel comfortable with the accuracy of our matches without losing meaningful data. That requires iteration, detailed review of results, and some degree of judgment.
I hope this vignette has demonstrated the value of the
fuzzyjoin
package, especially when dealing with
unstructured / scraped data. The package offers many more functions, so
opportunities abound for extension of this vignette!
Working with Keith’s example using the fuzzyjoin library, I will be further showcasing what this package can do using a geo_left_join function based on geo-location data and the finding the distance between them.
The data for this example will be using MTA subway station locations, which provides latitude and longitude for every subway exit and its subway line. In addition, I will be using data on NYC’s public acute care hospitals throughout the NYC region. This dataset also provides latitude and longitude information.
The goal will be to determine which subway lines are at most one-mile away from the hospital.
Here, the MTA data is imported and cleaned up to be joined later on with the NYC hospital data.
library(janitor)
mta_df <- read_csv('input_data/mta_subway_stations.csv') |>
clean_names()
station_locations <-
mta_df |>
distinct(route1, station_latitude, station_longitude) |>
rename(train_route = route1, latitude = station_latitude, longitude = station_longitude)
We continue with cleaning of the data so that it can be joined with the MTA data.
hhc_df <- read_csv('input_data/nyc_hhc.csv') |>
clean_names()
hhc_locations <-
hhc_df |>
filter(facility_type == 'Acute Care Hospital') |>
mutate(lat_long = str_split_i(location_1, "\\(", 2),
latitude = as.numeric(str_split_i(lat_long, "\\,", 1)),
longitude = str_split_i(lat_long, "\\,", 2),
longitude = as.numeric(str_remove(longitude, "\\)"))) |>
select(facility_name, latitude, longitude)
We can use fuzzyjoin and the geo_left_join() function to combine both datasets and determine a maximum distance on one-mile apart from the hospital and subway exit. It is important to note that subway lines have multiple exits and may vary how far from the hospital it will be. The data merge will be filtered to find the shortest exit path from the subway line to the hospital.
station_hhc <-
hhc_locations |>
geo_left_join(station_locations, max_dist = 1, distance_col = 'dist_to_hosp')
station_hhc_short <-
station_hhc |>
group_by(facility_name, train_route) |>
filter(dist_to_hosp == min(dist_to_hosp)) |>
mutate(dist_to_hosp = round(dist_to_hosp, 2)) |>
ungroup()
station_hhc_short <-
station_hhc_short |>
group_by(facility_name) |>
arrange(facility_name, train_route) |>
mutate(all_trains = str_c(train_route, collapse = ", ")) |>
ungroup()
Lastly, lets plot on an interactive map the location of the public hospital using black circles and the location of the subway exit in orange circles. When clicking on the hospital location, it will provide which trains are within a one-mile radius and the subway exit marker will provide the hospital and the distance to it.
library(leaflet)
leaflet(station_hhc_short) |>
addTiles() |>
addCircles(lng = ~longitude.x,
lat = ~latitude.x,
radius = 30,
weight = 1,
color = 'Black',
fillColor = 'Black',
fillOpacity = 0.05,
popup = paste("<b>Hospital: </b>", station_hhc_short$facility_name, "<br>",
"<b>Subway Lines: </b>", station_hhc_short$all_trains)
) |>
addCircles(lng = ~longitude.y,
lat = ~latitude.y,
radius = 15,
weight = 1,
color = '#c85200',
fillColor = '#c85200',
popup = paste("<b>Subway Line: </b>", station_hhc_short$train_route, "<br>",
"<b>Hospital: </b>", station_hhc_short$facility_name, "<br>",
"<b>Distance: </b>", station_hhc_short$dist_to_hosp, "mi.")
) |>
addProviderTiles("CartoDB.Positron") |>
setView(-73.98, 40.75, zoom = 10)
Fuzzyjoin is not limited to finding similar names but also geo-location aspects.