Optimal string joining example

In this example, we match up two vectors of strings in an optimal way using the stringdist package. This is a common task when working with sociological data at the country-level or for lower administrative divisions such as US states.

First we load some libraries:

# libs --------------------------------------------------------------------
library(pacman)
p_load(stringdist, reshape2, dplyr)

Then we make up some example data. I’ve picked five countries that have names that usually differ somewhat between Danish and English, sometimes not at all, and sometimes a lot.

# data --------------------------------------------------------------------
#EN names
EN = c("Denmark", "Norway", "USA", "Russia", "Germany")
#DA names
DA = c("Danmark", "Norge", "USA", "Rusland", "Tyskland")

Next we calculate the distances between strings across vectors and reshape the data a bit:

# distances ----------------------------------------------------------------
#matrix
dst = stringdist::stringdistmatrix(EN, DA)
#names
rownames(dst) = EN; colnames(dst) = DA
dst
##         Danmark Norge USA Rusland Tyskland
## Denmark       1     7   7       6        7
## Norway        6     3   6       6        7
## USA           7     5   0       7        8
## Russia        7     6   6       4        6
## Germany       5     6   7       5        6
#conver to 2-column data.frame
dst_df = melt(dst, c("EN", "DA"))
dst_df
##         EN       DA value
## 1  Denmark  Danmark     1
## 2   Norway  Danmark     6
## 3      USA  Danmark     7
## 4   Russia  Danmark     7
## 5  Germany  Danmark     5
## 6  Denmark    Norge     7
## 7   Norway    Norge     3
## 8      USA    Norge     5
## 9   Russia    Norge     6
## 10 Germany    Norge     6
## 11 Denmark      USA     7
## 12  Norway      USA     6
## 13     USA      USA     0
## 14  Russia      USA     6
## 15 Germany      USA     7
## 16 Denmark  Rusland     6
## 17  Norway  Rusland     6
## 18     USA  Rusland     7
## 19  Russia  Rusland     4
## 20 Germany  Rusland     5
## 21 Denmark Tyskland     7
## 22  Norway Tyskland     7
## 23     USA Tyskland     8
## 24  Russia Tyskland     6
## 25 Germany Tyskland     6
#sort
dst_df = dplyr::arrange(dst_df, value)
dst_df
##         EN       DA value
## 1      USA      USA     0
## 2  Denmark  Danmark     1
## 3   Norway    Norge     3
## 4   Russia  Rusland     4
## 5  Germany  Danmark     5
## 6      USA    Norge     5
## 7  Germany  Rusland     5
## 8   Norway  Danmark     6
## 9   Russia    Norge     6
## 10 Germany    Norge     6
## 11  Norway      USA     6
## 12  Russia      USA     6
## 13 Denmark  Rusland     6
## 14  Norway  Rusland     6
## 15  Russia Tyskland     6
## 16 Germany Tyskland     6
## 17     USA  Danmark     7
## 18  Russia  Danmark     7
## 19 Denmark    Norge     7
## 20 Denmark      USA     7
## 21 Germany      USA     7
## 22     USA  Rusland     7
## 23 Denmark Tyskland     7
## 24  Norway Tyskland     7
## 25     USA Tyskland     8
#save copy
dst_df_orig = dst_df

Finally, we loop around this object and pick the best matches one by one:

# match -------------------------------------------------------------------
#storing best matches
best_matches = matrix(nrow=0, ncol=3)

#keep matching and removing pairs until we run out of data
while (nrow(dst_df) > 0) {
  #top value is always the best match because we sorted the data initially
  best_matches = rbind(best_matches, dst_df[1, ])
  
  #remove rows with the same names
  #i.e. keep only those that have non-identical names in both columns to the ones we saved
  dst_df = dplyr::filter(dst_df, (!dst_df[1, 1] == dst_df[, 1]) & (!dst_df[1, 2] == dst_df[, 2]))
}

#view matches
best_matches
##        EN       DA value
## 1     USA      USA     0
## 2 Denmark  Danmark     1
## 3  Norway    Norge     3
## 4  Russia  Rusland     4
## 5 Germany Tyskland     6

As can be seen, all the pairs were matched up correctly. Even Germany which has a totally dissimilar name to the Danish one (which is related to the German and Dutch names: Deutschland, Duitsland). The reason is simply that it was the last pair in the dataset, so it got paired up no matter how distant. This can cause trouble if one ends up with two a group of names with no sensible matches, so beware of the matching produced.

One can modify this setup so that it stops when distances becomes too large, like the join functions in the fuzzyjoin package. One can also use other string distance measures. Here we used the default one from stringdist package, but it has a number of other ones that may be more suitable.

A join function

Finally, an example implementation of the above system using stringdist and dplyr as the main workhorses. Our test tables consist of partial, but real dataset from my latest project on Argentina:

d1
##                  Province European
## 1            Buenos Aires     0.76
## 2  Buenos Aires City (DC)     0.80
## 3               Catamarca     0.50
## 4                   Chaco     0.66
## 5                  Chubut     0.54
## 6                 Córdoba     0.65
## 7              Corrientes     0.69
## 8              Entre Ríos     0.78
## 9                 Formosa     0.68
## 10                  Jujuy     0.31
## 11               La Pampa     0.81
## 12               La Rioja     0.50
## 13                Mendoza     0.70
## 14               Misiones     0.71
## 15                Neuquén     0.72
## 16              Río Negro     0.69
## 17                  Salta     0.31
## 18               San Juan     0.62
## 19               San Luis     0.67
## 20             Santa Cruz     0.55
## 21               Santa Fe     0.80
## 22    Santiago del Estero     0.43
## 23       Tierra del Fuego     0.55
## 24                Tucumán     0.61
d2
##                  Province Latitude
## 1            Buenos Aires    34.92
## 2               Catamarca    28.47
## 3                   Chaco    27.45
## 4                  Chubut    43.30
## 5  Ciudad de Buenos Aires    34.60
## 6                 Cordoba    31.42
## 7              Corrientes    27.47
## 8              Entre Rios    31.74
## 9                 Formosa    26.19
## 10                  Jujuy    24.19
## 11               La Pampa    36.62
## 12               La Rioja    29.41
## 13                Mendoza    32.89
## 14               Misiones    27.36
## 15                Neuquen    38.95
## 16              Rio Negro    40.81
## 17                  Salta    24.78
## 18               San Juan    31.54
## 19               San Luis    33.30
## 20             Santa Cruz    51.62
## 21               Santa Fe    31.61
## 22    Santiago del Estero    27.78
## 23       Tierra del Fuego    54.80
## 24                Tucuman    26.81

We see that the names generally, but not entirely match up. There are some missing diacritics and the capital has divergent names. The join function looks like this:

stringdist_optimal_join = function(x, y, x_col, y_col, preview = F, ...) {
  library(reshape2); library(dplyr); library(fuzzyjoin); library(kirkegaard)
  #devtools::install_github("deleetdk/kirkegaard")
  
  #check x, y
  is_(x, class = "data.frame", error_on_false = T)
  is_(y, class = "data.frame", error_on_false = T)
  is_(preview, class = "logical", error_on_false = T)
  
  if (missing(x_col) & missing(y_col)) {
    #extract vctrs
    x_names = rownames(x)
    y_names = rownames(y)
    message("x_col or y_col not given, using rownames from both")
  } else {
    #extract vctrs
    x_names = x[[x_col]]
    y_names = y[[y_col]]
  }

  #check factors
  if (is.factor(x_names)) {
    x_names %<>% as.character()
    warning("x input has a fctr column. This was converted to a chr column.")
  }
  if (is.factor(y_names)) {
    y_names %<>%  as.character()
    warning("y input has a fctr column. This was converted to a chr column.")
  }
  
  #check duplicates
  if (any(duplicated(x_names))) stop("Duplicate names in x!")
  if (any(duplicated(y_names))) stop("Duplicate names in y!")
  
  #distances
  dst = stringdist::stringdistmatrix(x_names, y_names, ...)
  
  #names
  rownames(dst) = x_names; colnames(dst) = y_names

  #conver to 2-column data.frame
  dst_df = melt(dst, c("x", "y"))

  #sort
  dst_df = dplyr::arrange(dst_df, value)

  #storing best matches
  best_matches = matrix(nrow=0, ncol=3)
  
  #keep matching and removing pairs until we run out of data
  while (nrow(dst_df) > 0) {
    #top value is always the best match because we sorted the data initially
    best_matches = rbind(best_matches, dst_df[1, ])
    
    #remove rows with the same names
    #i.e. keep only those that have non-identical names in both columns to the ones we saved
    dst_df = dplyr::filter(dst_df, (!dst_df[1, 1] == dst_df[, 1]) & (!dst_df[1, 2] == dst_df[, 2]))
  }
  
  #return just match info?
  if (preview) return(best_matches)
  
  #make rownames to reorder with
  rownames(x) = x_names
  rownames(y) = y_names
  
  #reorder x, y
  x = x[best_matches$x, ]
  y = y[best_matches$y, ]
  
  #insert matching columns
  x$..names = best_matches$x
  y$..names = best_matches$x
  
  #join
  d_out = dplyr::full_join(x, y, by = c("..names" = "..names"))
  
  #remove matching columns
  d_out$..names = NULL
  
  #insert stringdist value
  d_out$.stringdist = best_matches$value
  
  #return
  d_out
}

And then we join the tables:

stringdist_optimal_join(d1, d2, "Province", "Province")
## 
## Attaching package: 'kirkegaard'
## The following object is masked from 'package:base':
## 
##     +
##                Province.x European             Province.y Latitude
## 1            Buenos Aires     0.76           Buenos Aires    34.92
## 2               Catamarca     0.50              Catamarca    28.47
## 3                   Chaco     0.66                  Chaco    27.45
## 4                  Chubut     0.54                 Chubut    43.30
## 5              Corrientes     0.69             Corrientes    27.47
## 6                 Formosa     0.68                Formosa    26.19
## 7                   Jujuy     0.31                  Jujuy    24.19
## 8                La Pampa     0.81               La Pampa    36.62
## 9                La Rioja     0.50               La Rioja    29.41
## 10                Mendoza     0.70                Mendoza    32.89
## 11               Misiones     0.71               Misiones    27.36
## 12                  Salta     0.31                  Salta    24.78
## 13               San Juan     0.62               San Juan    31.54
## 14               San Luis     0.67               San Luis    33.30
## 15             Santa Cruz     0.55             Santa Cruz    51.62
## 16               Santa Fe     0.80               Santa Fe    31.61
## 17    Santiago del Estero     0.43    Santiago del Estero    27.78
## 18       Tierra del Fuego     0.55       Tierra del Fuego    54.80
## 19                Córdoba     0.65                Cordoba    31.42
## 20             Entre Ríos     0.78             Entre Rios    31.74
## 21                Neuquén     0.72                Neuquen    38.95
## 22              Río Negro     0.69              Rio Negro    40.81
## 23                Tucumán     0.61                Tucuman    26.81
## 24 Buenos Aires City (DC)     0.80 Ciudad de Buenos Aires    34.60
##    .stringdist
## 1            0
## 2            0
## 3            0
## 4            0
## 5            0
## 6            0
## 7            0
## 8            0
## 9            0
## 10           0
## 11           0
## 12           0
## 13           0
## 14           0
## 15           0
## 16           0
## 17           0
## 18           0
## 19           1
## 20           1
## 21           1
## 22           1
## 23           1
## 24          20