In this example, we match up two vectors of strings in an optimal way using the stringdist package. This is a common task when working with sociological data at the country-level or for lower administrative divisions such as US states.
First we load some libraries:
# libs --------------------------------------------------------------------
library(pacman)
p_load(stringdist, reshape2, dplyr)
Then we make up some example data. I’ve picked five countries that have names that usually differ somewhat between Danish and English, sometimes not at all, and sometimes a lot.
# data --------------------------------------------------------------------
#EN names
EN = c("Denmark", "Norway", "USA", "Russia", "Germany")
#DA names
DA = c("Danmark", "Norge", "USA", "Rusland", "Tyskland")
Next we calculate the distances between strings across vectors and reshape the data a bit:
# distances ----------------------------------------------------------------
#matrix
dst = stringdist::stringdistmatrix(EN, DA)
#names
rownames(dst) = EN; colnames(dst) = DA
dst
## Danmark Norge USA Rusland Tyskland
## Denmark 1 7 7 6 7
## Norway 6 3 6 6 7
## USA 7 5 0 7 8
## Russia 7 6 6 4 6
## Germany 5 6 7 5 6
#conver to 2-column data.frame
dst_df = melt(dst, c("EN", "DA"))
dst_df
## EN DA value
## 1 Denmark Danmark 1
## 2 Norway Danmark 6
## 3 USA Danmark 7
## 4 Russia Danmark 7
## 5 Germany Danmark 5
## 6 Denmark Norge 7
## 7 Norway Norge 3
## 8 USA Norge 5
## 9 Russia Norge 6
## 10 Germany Norge 6
## 11 Denmark USA 7
## 12 Norway USA 6
## 13 USA USA 0
## 14 Russia USA 6
## 15 Germany USA 7
## 16 Denmark Rusland 6
## 17 Norway Rusland 6
## 18 USA Rusland 7
## 19 Russia Rusland 4
## 20 Germany Rusland 5
## 21 Denmark Tyskland 7
## 22 Norway Tyskland 7
## 23 USA Tyskland 8
## 24 Russia Tyskland 6
## 25 Germany Tyskland 6
#sort
dst_df = dplyr::arrange(dst_df, value)
dst_df
## EN DA value
## 1 USA USA 0
## 2 Denmark Danmark 1
## 3 Norway Norge 3
## 4 Russia Rusland 4
## 5 Germany Danmark 5
## 6 USA Norge 5
## 7 Germany Rusland 5
## 8 Norway Danmark 6
## 9 Russia Norge 6
## 10 Germany Norge 6
## 11 Norway USA 6
## 12 Russia USA 6
## 13 Denmark Rusland 6
## 14 Norway Rusland 6
## 15 Russia Tyskland 6
## 16 Germany Tyskland 6
## 17 USA Danmark 7
## 18 Russia Danmark 7
## 19 Denmark Norge 7
## 20 Denmark USA 7
## 21 Germany USA 7
## 22 USA Rusland 7
## 23 Denmark Tyskland 7
## 24 Norway Tyskland 7
## 25 USA Tyskland 8
#save copy
dst_df_orig = dst_df
Finally, we loop around this object and pick the best matches one by one:
# match -------------------------------------------------------------------
#storing best matches
best_matches = matrix(nrow=0, ncol=3)
#keep matching and removing pairs until we run out of data
while (nrow(dst_df) > 0) {
#top value is always the best match because we sorted the data initially
best_matches = rbind(best_matches, dst_df[1, ])
#remove rows with the same names
#i.e. keep only those that have non-identical names in both columns to the ones we saved
dst_df = dplyr::filter(dst_df, (!dst_df[1, 1] == dst_df[, 1]) & (!dst_df[1, 2] == dst_df[, 2]))
}
#view matches
best_matches
## EN DA value
## 1 USA USA 0
## 2 Denmark Danmark 1
## 3 Norway Norge 3
## 4 Russia Rusland 4
## 5 Germany Tyskland 6
As can be seen, all the pairs were matched up correctly. Even Germany which has a totally dissimilar name to the Danish one (which is related to the German and Dutch names: Deutschland, Duitsland).
One can modify this setup so that it stops when distances becomes too large, like the join functions in the fuzzyjoin package. One can also use other string distance measures. Here we used the default one from stringdist package, but it has a number of other ones that may be more suitable.