library(tidyverse)
library(fuzzyjoin)
This vignette will introduce the fuzzyjoin package,
which enables joining of two datasets based on imperfect matches. This
package is very helpful for combining data without unique
keys.
We will use data related to candidates running in the 2022 election for the House of Representatives. Specifically, we’ll aim to join data from the Federal Election Committee (FEC) with data scraped from the Ballotpedia website. While the FEC data has a unique key (the FEC ID) for each candidate, the Ballotpedia data does not. So, we will need to rely primarily on candidate names to perform matches.
Candidate names, however, are not consistent across both sources. As we will see, names sometime include middle names and suffixes (e.g. Jr.) but other times do not. Some candidates are listed according to a nickname (Bill versus William) or initials (Jonathan Taylor versus JT). As a result, using names as a unique key would fail to capture many matches. If the names are relatively close, however, we can use fuzzy joins to get the job done!
Our first dataset comes from the FEC API (https://api.open.fec.gov/developers/), listing all federally registered House candidates in the 2022 election cycle. I prepared the data ahead of time for simplicity, so we can just read it in here.
fec <- read_csv('input_data/fec_house_candidates_2022.csv')
## Rows: 3676 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): fec_id, name, state, district, party, office, incumbent_challenge, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(fec)
## Rows: 3,676
## Columns: 8
## $ fec_id <chr> "H2CO07170", "H2UT03280", "H2CA30291", "H2MN07162"…
## $ name <chr> "AADLAND, ERIK", "AALDERS, TIM", "AAZAMI, SHERVIN"…
## $ state <chr> "CO", "UT", "CA", "MN", "NY", "MN", "OH", "CA", "N…
## $ district <chr> "07", "03", "32", "07", "12", "05", "09", "50", "0…
## $ party <chr> "REPUBLICAN PARTY", "REPUBLICAN PARTY", "DEMOCRATI…
## $ office <chr> "House", "House", "House", "House", "House", "Hous…
## $ incumbent_challenge <chr> "Open seat", "Challenger", "Challenger", "Challeng…
## $ candidate_status <chr> "P", "C", "N", "P", "P", "P", "N", "C", "N", "N", …
Our second dataset comes from Ballotpedia, a nonprofit online political encyclopedia. As above, I prepared the data ahead of time (see https://github.com/kac624/cuny/blob/main/D607/ballotpediaScrape.ipynb), so we can simply read it in here.
ballotpedia <- read_csv('input_data/ballotpedia_scrape2022.csv')
## Rows: 2289 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): name, party, state, last_election, twitter, ballotpedia
## dbl (1): district
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(ballotpedia)
## Rows: 2,289
## Columns: 7
## $ name <chr> "Tony Gonzales", "Willie Montague", "Michelle Munroe", "…
## $ party <chr> "Republican Party", "Republican Party", "Democratic Part…
## $ state <chr> "Texas", "Florida", "Georgia", "Ohio", "Mississippi", "S…
## $ district <dbl> 23, 10, 1, 8, 4, 5, 13, 2, 5, 8, 8, 4, 46, 5, 14, 1, 32,…
## $ last_election <chr> "8-Nov-22", "23-Aug-22", "24-May-22", "3-May-22", "7-Jun…
## $ twitter <chr> "['https://www.twitter.com/RepTonyGonzales', 'https://ww…
## $ ballotpedia <chr> "https://ballotpedia.org/Tony_Gonzales", "https://ballot…
We can already see some significant differences between the names across the two datasets. Let’s look at one specific example to illustrate these differences.
fec %>%
filter(str_detect(name, 'ELLZEY')) %>%
select(name, state, district, party)
## # A tibble: 1 × 4
## name state district party
## <chr> <chr> <chr> <chr>
## 1 ELLZEY, JOHN KEVIN SR. TX 06 REPUBLICAN PARTY
ballotpedia %>%
filter(str_detect(name, 'Ellzey')) %>%
select(name, state, district, party)
## # A tibble: 1 × 4
## name state district party
## <chr> <chr> <dbl> <chr>
## 1 Jake Ellzey Texas 6 Republican Party
In the FEC data, the Republican candidate for the 6th district of Texas is listed as John Kevin Ellzey Sr., but in the Ballotpedia data, he is listed as Jake Ellzey. We can use the party, state and district fields to guide our join, but ultimately, we’ll need to rely on imperfect name matches to connect these two datasets.
First, we’ll do some cleaning on the state and district fields of the Ballotpedia data so that it matches the format of the FEC data. We’ll also remove duplicate rows in the Ballotpedia data.
ballotpedia <- ballotpedia[!duplicated(ballotpedia),] %>%
mutate(district = if_else(is.na(district), '00', sprintf("%02d", district)),
state = str_replace_all(state, '_', ' '))
head(ballotpedia)
## # A tibble: 6 × 7
## name party state district last_election twitter ballotpedia
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Tony Gonzales Republican P… Texas 23 8-Nov-22 ['http… https://ba…
## 2 Willie Montague Republican P… Flor… 10 23-Aug-22 ['http… https://ba…
## 3 Michelle Munroe Democratic P… Geor… 01 24-May-22 ['http… https://ba…
## 4 Phil Heimlich Republican P… Ohio 08 3-May-22 ['http… https://ba…
## 5 Carl Boyanton Republican P… Miss… 04 7-Jun-22 ['http… https://ba…
## 6 Larry Gaither Green Party Sout… 05 8-Nov-22 [] https://ba…
We’ll do some additional cleaning to get the party and name columns
as close as possible. For the party column, we’ll simplify things by
listing party as one of three options: Republican, Democrat or Other.
For the name column, we’ll actually create a new column (named
key) with all lowercase letters and a number of “noisy”
items removed (punctuation, suffixes, prefixes and duplicate spaces).
Removing this noise will support more accurate matches in the fuzzy
join.
fec <- fec %>%
mutate(party_simple = case_when(party == 'REPUBLICAN PARTY' ~ 'Republican',
party == 'DEMOCRATIC PARTY' ~ 'Democrat',
TRUE ~ 'Other'),
key = case_when(!str_detect(name,',') ~ name,
TRUE ~ str_c(str_extract(name, '(?<=, ).*'), ' ',
str_extract(name, '.*(?=,)'))) %>%
str_remove_all(str_c(
'(MS\\.|MRS\\.|MR\\.|DR\\.|JR\\.|JR|SR\\.|III|II|',
'\\b[A-Z]\\.\\b|\\b[A-Z]\\b|É|[0-9])')) %>%
str_remove_all('[[:punct:]]') %>%
str_replace_all(' ', ' ') %>%
trimws() %>%
tolower(),
key = str_c(str_extract(key, '^[A-Za-z]+'),
str_extract(key, ' [A-Za-z]+$')))
ballotpedia <- ballotpedia %>%
mutate(state = state.abb[match(state,state.name)],
party_simple = case_when(party == 'Republican Party' ~ 'Republican',
party == 'Democratic Party' ~ 'Democrat',
TRUE ~ 'Other'),
key = str_remove_all(name, '(III|II|Jr\\.|Sr\\.|\\b[A-Z]\\.\\b|é)') %>%
str_remove_all('[[:punct:]]') %>%
str_replace_all(' ', ' ') %>%
trimws() %>%
tolower(),
key = str_c(str_extract(key, '^[A-Za-z]+'),
str_extract(key, ' [A-Za-z]+$')))
fec[is.na(fec)] <- ''
ballotpedia[is.na(ballotpedia)] <- ''
fec$key[1:5]
## [1] "erik aadland" "tim aalders" "shervin aazami" "jill abahsain"
## [5] "rana abdelhamid"
ballotpedia$key[1:5]
## [1] "tony gonzales" "willie montague" "michelle munroe" "phil heimlich"
## [5] "carl boyanton"
With our names cleaned, we can now try our first fuzzy join! While
the fuzzyjoin package offers a number of functions, we’ll
focus on the stringdist_join function, which uses imperfect
string matches. For this example, we want to compare the number of
matches using different fuzzy parameters, so we’ll set the join mode to
inner (this parameter takes the typical join types, such as
left, right and full).
The most critical arguments are method and
max_dist. The method argument determines the
mathematical approach for identifying matches. The various options are
all detailed in documentation for the stringdist package
(see https://search.r-project.org/CRAN/refmans/stringdist/html/stringdist-metrics.html).
We’ll use the lv option for now, which corresponds to
Levenshtein distance, a function that counts the number of deletions,
insertions and substitutions necessary to turn string b
into string a.
The max_dist argument serves as a threshold for how
fuzzy we want our matches to be. A higher max_dist
means that more dissimilar strings will still count as matches.
Conversely, a lower max_dist will result in only strings
that are very similar counting as matches. We add the
distance_col argument to create a new column that details
the distance of the two strings matched. We’ll set this to 3 for
now.
fuzzy1 <- fec %>%
stringdist_join(
select(ballotpedia, name, key),
by = 'key',
mode = 'inner',
method = 'lv',
max_dist = 3,
ignore_case = TRUE,
distance_col = 'distance')
nrow(fuzzy1)
## [1] 2915
fuzzy1 %>%
filter(distance > 2) %>%
select(name.x, key.x, name.y, key.y) %>%
arrange(key.x) %>%
head(10)
## # A tibble: 10 × 4
## name.x key.x name.y key.y
## <chr> <chr> <chr> <chr>
## 1 JAY, AARON MR aaron jay Aaron Bean aaron bean
## 2 LEE, AARON PAYTON aaron lee Aaron Bean aaron bean
## 3 SIMS, AARON MICHAEL aaron sims Alan Sims alan sims
## 4 SMITH, AARON aaron smith Adrian Smith adrian smith
## 5 SMITH, AARON aaron smith Ian Smith ian smith
## 6 MARTIN, ADAM adam martin Adam Morgan adam morgan
## 7 MORGAN, ADAM adam morgan Adam Martin adam martin
## 8 ORTIZ, ADAM adam ortiz Adam Martin adam martin
## 9 SMITH, D. ADAM adam smith D. Adam Smith d smith
## 10 SMITH, D. ADAM adam smith Adrian Smith adrian smith
We see nearly 3000 matches, which significantly exceeds the total number of observations in the Ballotpedia data. Moreover, when we view the matches with the greatest distance (3), we see some erroneous matches. Aaron Bean has matched with both Arron Jay and Aaron Lee.
So, let’s try again, but with a lower threshold, setting
max_dist to 1.
fuzzy2 <- fec %>%
stringdist_join(
select(ballotpedia, name, key),
by = 'key',
mode = 'inner',
method = 'lv',
max_dist = 1,
ignore_case = TRUE,
distance_col = 'distance')
nrow(fuzzy2)
## [1] 2002
fuzzy2 %>%
select(name.x, key.x, name.y, key.y) %>%
arrange(key.x) %>%
head(10)
## # A tibble: 10 × 4
## name.x key.x name.y key.y
## <chr> <chr> <chr> <chr>
## 1 RHODES, AARIKA SAMONE aarika rhodes Aarika Rhodes aarika rhodes
## 2 BEAN, AARON P. aaron bean Aaron Bean aaron bean
## 3 CALKINS, AARON (A.J.) aaron calkins Aaron Calkins aaron calkins
## 4 HEINEMAN, AARON aaron heineman Aaron Heineman aaron heineman
## 5 STORER, AARON aaron storer Aaron Storer aaron storer
## 6 MULUGHETA, ABEL abel mulugheta Abel Mulugheta abel mulugheta
## 7 GARAPATI, ABHIRAM abhiram garapati Abhiram Garapati abhiram garapati
## 8 SPANBERGER, ABIGAIL abigail spanberger Abigail Spanberger abigail spanberg…
## 9 FRISCH, ADAM adam frisch Adam Frisch adam frisch
## 10 GRAY, ADAM C. adam gray Adam Gray adam gray
With a lower threshold these matches appear more accurate. However, we also appear to fail to match a number of candidates, as our joined dataframe has only ~2000 rows, compared to 2289 in the Ballotpedia dataset.
Perhaps there is a middle ground? We can increase the threshold again, but this time use “helper columns” to get rid of erroneous matches. Specifically, we can use the state, district and party columns to further refine our join. Once the join is complete, we then filter for those rows in which these three columns match.
fuzzy3 <- fec %>%
stringdist_join(
select(ballotpedia, name, key, state, district, party_simple),
by = 'key',
mode = 'inner',
method = 'lv',
max_dist = 3,
ignore_case = TRUE,
distance_col = 'distance') %>%
filter(state.x == state.y,
district.x == district.y,
party_simple.x == party_simple.y)
nrow(fuzzy3)
## [1] 2005
fuzzy3 %>%
filter(distance > 2) %>%
select(name.x, key.x, name.y, key.y) %>%
arrange(key.x) %>%
head(10)
## # A tibble: 10 × 4
## name.x key.x name.y key.y
## <chr> <chr> <chr> <chr>
## 1 SMITH, D. ADAM adam smith D. Adam Smith d smith
## 2 THAKUR, ALBERT 'BURT' albert thakur Burt Thakur burt thakur
## 3 HORNING, ANDREW andrew horning Andy Horning andy horning
## 4 HORNING, ANDREW M. andrew horning Andy Horning andy horning
## 5 CRAIG, ANGELA DAWN angela craig Angie Craig angie craig
## 6 NORMOYLE, ANGELA H. angela normoyle Angie Normoyle angie normoyle
## 7 COWDEN, ANTHONY 'TONY' anthony cowden Tony Cowden tony cowden
## 8 VARGAS, ANTHONY 'TONY' anthony vargas Tony Vargas tony vargas
## 9 PETERSON, ARTHUR LAVERNE DR arthur peterson Art Peterson art peterson
## 10 GOLDBERG, BARTLEY F. bartley goldberg Bart Goldberg bart goldberg
We see now that even our most distance matches appear accurate. The distance allows for names that diverge significantly in terms of formatting to still be matched. However, we still only have ~2000 matches, so we’re missing just under 300 candidates. What if we increase the threshold further, while still using our helper columns?
fuzzy4 <- fec %>%
stringdist_join(
select(ballotpedia, name, key, state, district, party_simple),
by = 'key',
mode = 'inner',
method = 'lv',
max_dist = 6,
ignore_case = TRUE,
distance_col = 'distance') %>%
filter(state.x == state.y,
district.x == district.y,
party_simple.x == party_simple.y)
nrow(fuzzy4)
## [1] 2198
fuzzy4 %>%
filter(distance > 4) %>%
select(name.x, key.x, name.y, key.y) %>%
arrange(key.x) %>%
head(10)
## # A tibble: 10 × 4
## name.x key.x name.y key.y
## <chr> <chr> <chr> <chr>
## 1 MAKROM, AHMED MATTHEW ahmed makrom Matt Makrom matt makrom
## 2 COHN, ALAN alan cohn Gavin Brown gavin brown
## 3 FERGUSON, ANDERSON DREW IV anderson ferguson Drew Ferguson drew ferguson
## 4 LOUDERBACK, ANDREW JOHN andrew louderback A.J. Louderback j louderback
## 5 PHELPS, ANDREW H. andrew phelps Andrew Halcro andrew halcro
## 6 JOHNSON, ARTHUR WAYNE arthur johnson A. Wayne Johnson a johnson
## 7 ARENHOLZ, ASHLEY HINSON ashley arenholz Ashley Hinson ashley hinson
## 8 BEARDSLEY, BENJAMIN RYAN benjamin beardsley Ben Beardsley ben beardsley
## 9 SMILOWITZ, BENJAMIN benjamin smilowitz Ben Smilowitz ben smilowitz
## 10 BEAL, BRIAN T. brian beal Tim Beck tim beck
Finally, we’ve gotten very close to a full match for all 2289 candidates in the Ballotpedia data. Moreover, even our most distant matches (those above 4) appear accurate. We do, however, see one match that appears inaccurate: Brian Beal has matched with Tim Beck.
We could continue to tweak our join criteria to refine this join, but ultimately, we need to aim for a “sweet spot” where we feel comfortable with the accuracy of our matches without losing meaningful data. That requires iteration, detailed review of results, and some degree of judgment.
I hope this vignette has demonstrated the value of the
fuzzyjoin package, especially when dealing with
unstructured / scraped data. The package offers many more functions, so
opportunities abound for extension of this vignette!