Fuzzy Matching

Fuzzy Matching is a technique for identifying approximate matches between strings, useful when exact matching fails due to typographical errors or variations in data. It employs algorithms like Levenshtein Distance or Jaccard Similarity to measure string similarity. Applications include data deduplication, record linkage, and text preprocessing in messy datasets.

The fuzzyjoin package in R is a powerful tool for performing joins between two datasets when exact matches are not feasible. This often occurs when datasets have typographical errors, variations in spelling, or inconsistent formatting. Unlike standard joins, which require exact matches, fuzzyjoin utilizes approximate matching techniques, such as string distances or other similarity measures, to link records.

For instance, one can use stringdist_left_join() to combine datasets by calculating the distance between strings and defining a threshold for acceptable matches. This makes it particularly useful in real-world scenarios, such as cleaning messy data or linking datasets from multiple sources where naming conventions differ. Notably, the package supports various string distance methods, including Levenshtein, Jaccard, and Cosine (Robinson & Hayes, 2020). Additionally, it integrates seamlessly with the tidyverse framework, making it highly versatile and user-friendly (Wickham et al., 2019).

A practical example might involve matching customer names in a CRM system to names in an external database, even if there are slight spelling differences. This flexibility makes fuzzyjoin indispensable for data preprocessing, enhancing accuracy in tasks like deduplication and data reconciliation.

An Example

Below are the R codes for Fuzzy Matching and its application.

# Clear R environment: 

rm(list = ls())

library(dplyr)
library(tidyr)
library(stringr)
library(fuzzyjoin)
library(viridis)
library(ggplot2)
library(tidycensus)

# Retrieve unemployed and labor force data: 

unemployment_data <- get_acs(geography = "county",
                             variables = c(unemployed = "B23025_005", labor_force = "B23025_003"),
                             year = 2021,
                             survey = "acs5")

# Convert to wide form data: 

unemployment_data %>% 
  select(NAME, variable, estimate) %>% 
  pivot_wider(names_from = "variable", values_from = "estimate") -> unemployment_wide

# Calculate unemployment rate by country: 

unemployment_wide %>% 
  mutate(rate = 100*unemployed / labor_force) %>% 
  mutate(NAME = str_to_lower(NAME)) -> unemployment_wide


unemployment_wide %>% 
  mutate(subregion = str_extract(NAME, ".*(?=\\s(county|parish|city))")) %>% 
  filter(!str_detect(subregion, "puerto rico")) %>% 
  filter(!str_detect(subregion, "alaska")) -> unemployment_wide

# Get data for all US counties: 

county_df <- map_data("county", projection = "albers", parameters = c(39, 45))

# Get data for all US states: 

state_df <- map_data("state", projection = "albers", parameters = c(39, 45))

# Fuzzy join: 

county_df %>%
  stringdist_left_join(unemployment_wide,
                       by = "subregion",
                       max_dist = 2,
                       method = "lv") -> county_df


# Make Choropleth Map: 

library(showtext)

my_font <- "Lato"

font_add_google(name = my_font, family = my_font)

showtext_auto()


ggplot() + 
  geom_polygon(data = county_df, aes(long, lat, group = group, fill = rate), colour = alpha("white", 0.1)) + 
  geom_path(data = state_df, aes(long, lat, group = group), colour = "white") + 
  theme_minimal() +
  theme(plot.background = element_rect(fill = "#EFF2F4", color = NA)) + 
  labs(title = "US Unemployment Rate by County, 2021", 
       subtitle = "This map shows the latest unemployment rate for each county in the United States.\nThe map updates itself daily and will show the most recent data available.", 
       caption = "Data Source: U.S Census Bureau | Graphic Design: Nguyen Chi Dung") + 
  theme(text = element_text(family = my_font)) + 
  theme(axis.text = element_blank()) + 
  theme(axis.title = element_blank()) + 
  theme(panel.grid = element_blank()) + 
  theme(plot.margin = unit(rep(0.5, 4), "cm")) +
  theme(plot.title = element_text(size = 18, color = "grey20")) + 
  theme(plot.subtitle = element_text(size = 10, color = "grey30")) + 
  theme(plot.caption = element_text(size = 8, color = "grey40")) + 
  scale_fill_viridis(discrete = FALSE,
                     option = "F", 
                     name = "Unemployment Rate (%)",
                     direction = -1,
                     guide = guide_colourbar(direction = "horizontal",
                                             barheight = unit(2, units = "mm"),
                                             barwidth = unit(40, units = "mm"),
                                             draw.ulim = FALSE,
                                             title.hjust = 0.5,
                                             label.hjust = 0.5, 
                                             title.position = "top")) + 
  theme(legend.position = c(0.5, 0.93)) + 
  theme(legend.text = element_text(color = "grey30")) + 
  theme(legend.title = element_text(colour = "grey20", size = 9)) 

References

  1. Robinson, D., & Hayes, M. (2020). fuzzyjoin: Join Data Frames on Inexact Matching. R package documentation.
  2. Wickham, H., François, R., Henry, L., & Müller, K. (2019). dplyr: A Grammar of Data Manipulation. R package documentation.