R for Record Linkage

Ahmad Emad & Celeste Stone
February 18th, 2016

Definition

  • determine if pairs of data records describe the same entity
  • Entities: usually people, organizations
  • Data records: names, addresses, date of birth, etc

Applications

  • Matching two disparate data sources with no common unique identifiers
  • Deduplicating entities within a dataset

Techniques

Manual

  • Not feasible as the number of records grows

Deterministic

  • Automatic comparisons where either everything needs to match, or specific data specific rules are programmed.
  • Needs new rule for every variation in data
  • Not generalizable to other datasets

Probabilistic

  • Estimate probability/likelihood two entities are same
  • Handles missing data, variations in coding

Probabilistic Matching Workflow

Probabilistic Matching Workflow

  • Preprocessing: developing link keys, extracting information from link keys, normalization of link keys
  • Reduction of search space: Blocking
  • Comparison: String metrics, year comparisons, numeric comparisons
  • Classification: Fellegi-Sunter Model
  • Final prediction: cut off scores, validation

Fellegi-Sunter Model

In the simplest version of the model:

  • Two sets to be linked \( A \) and \( B \) where \( A \cup B = M \times U \) (matches and nonmatches)
  • A binary vector-valued comparison function \( \gamma : A \times B \rightarrow {0,1}^d \)
  • There are weights \( m_j \) such that if \( g \) is a particular comparison outcome and \( z_{jk} \) is an indicator which is equal to 1, assume \[ P(\gamma = g\vert M) \prod_{j=1}^{d} P(\gamma_j = g_j\vert M) = \prod_{j=1}^{d} {m_j}^{g_{j}} (1-m_j)^{1-g_{j}} \]
    \( P(\gamma = g\vert M) \) is defined similarly.
  • Weights \( m_j \) can be fitted using expectation-maximization (EM) or Bayesian methods.
  • Extends naturally to comparison outcomes with more than 2 levels

Fellegi-Sunter Model

  • To determine if \( a \in A \) and \( b \in B \) represent the same individual, compute the comparison vector \( \gamma(a, b) = g \)
    and take the log-likelihood ratio (match score):


    \[ log L = log P(\gamma=g\vert M) - log P(\gamma=g\vert U) \]

  • More likely matches will have a higher match score. By studying the data, a threshold is found such that pairs with a match score above the threshold are considered matches.

String Comparators

Jaro distance

– calculates the number \( m \) of common characters that are within half the length of the longer string and the number of transpositions \( t \)

Jaro dist = \( \frac{1}{3} (\dfrac{m}{\vert s_1 \vert}+ \dfrac{m}{\vert s_2 \vert} + \dfrac{m-t}{\vert m\vert}) \)

Jaro-Winkler distance

– improves upon the Jaro algorithm by applying ideas based on empirical studies – Fewer errors occur at the beginning of strings

Jaro-winkler dist = \( Jaro\;dist + l(1-Jaro\;dist) \)
\( l \) is the length of common prefix at the start of the string up to a maximum of 4 characters

String Comparators

Levenshtein Distance

– Minimum number of edits needed to convert \( s_1 \) into \( s_2 \)
– Normalized by dividing by length of longer string
– Not widely used

Phonetic (Soundex)

– indexing names by sound, as pronounced in English
– Mainly encodes consonants, vowels are considered unless it is the first letter
– Do not work well on Asian names

  • Need to convert string distance metrics to discrete values.

Case Study: Example Research Question

How many projects at the National Science Foundation (NSF) result in patents?

  • Send out a survey to NSF researchers, time consuming and non response bias.
  • Link researchers from NSF to inventors in USPTO database.

Case Study: Input Data

Sources of Data

NSF

https://www.nsf.gov/awardsearch/download.jsp
– Data on NSF awards and researchers
– 230,531 individual researchers

USPTO

http://www.patentsview.org/download/
– Data on patents and inventors in the US since 1976
– 2,019,077 inventors

Case Study: Input Data

Relevant fields

– First name, last name, institution information, City name, state name

Blocking fields

– Use first initial, last name and state as blocking fields.

Case Study First look at the data

nsf <- read.csv("./data/nsf.csv", nrows = 1000)
print(head(nsf))
  InvestigatorId FirstName       LastName
1          62182        *. Atta-Ur-Rahman
2          60555        *. Atta-Ur-Rahman
3         411085         -          Robby
4         353742        -.           None
5          79724         A          Agate
6         271298         A          Anwar
                                     Name        CityName StateCode
1 Pennsylvania State Univ University Park UNIVERSITY PARK        PA
2                   University of Karachi         Karachi      NULL
3                 Kansas State University       MANHATTAN        KS
4        Rutgers University New Brunswick   NEW BRUNSWICK        NJ
5                        Individual Award       Baltimore        MD
6               University of Connecticut          Storrs        CT

Case Study: First look at the data contd

patents <- read.csv("./data/patents2.csv", nrows = 1000)
print(head(patents))
      seq     firstname      lastname                 city state
1 2931635          Eryk         Stacy            Silverado    CA
2 2931636    Donald Lee        Brisco        Lake Elsinore    CA
3  509994       Kirk R.          Hyde Palos Verdes Estates    CA
4  898216         Ralph         Jelic             Valencia    PA
5 3475416 Thomas Joseph       Wozniak          Minneapolis    MN
6 3401630    Domenic A. Tortolano Jr.             Johnston    RI
                         organization
1   Fleetwood Aluminum Products, Inc.
2   Fleetwood Aluminum Products, Inc.
3    Kirkplastic Company Incorporated
4 International Window Fashions, Inc.
5                     Terrybear, Inc.
6                   D. Gioielli, Inc.

Case Study: Preprocessing the data

  • Identifying common fields
patents <- patents[,c("seq", "firstname", "lastname", "city", "state", "organization")]
nsf <- nsf[, c("InvestigatorId", "FirstName", "LastName", "CityName", "StateCode", "Name")]
names(nsf) <- names(patents)  
  • Convert to Upper Case
patents[,-1] <- as.data.frame(sapply(patents[,-1], toupper))
nsf[,-1] <- as.data.frame(sapply(nsf[,-1], toupper))

Case Study: Preprocessing the data contd.

  • Remove Punctuations
patents[,-1] <- as.data.frame(sapply(patents[,-1], function(x) gsub("[[:punct:]]", "", x)))
nsf[,-1] <- as.data.frame(sapply(nsf[,-1], function(x) gsub("[[:punct:]]", "", x)))
patents[,-1] <- as.data.frame(sapply(patents[,-1], function(x) gsub(" ", "", x)))

-Remove user defined list of common words

toRemove <- c(" JR", " SR", " II", " III", " IV")
for (tR in toRemove) {
    patents$firstname <- gsub(tR, "", patents$firstname)
    patents$lastname <- gsub(tR, "", patents$lastname)
    nsf$firstname <- gsub(tR, "", nsf$firstname)
    nsf$lastname <- gsub(tR, "", nsf$lastname)
}

Case Study: Preprocessing the data contd.

-Remove spaces

patents[,-1] <- as.data.frame(sapply(patents[,-1], function(x) gsub(" ", "", x)))

nsf[,-1] <- as.data.frame(sapply(nsf[,-1], function(x) gsub(" ", "", x)))
  • Create “blocking” field
patents$flast <- paste(substring(patents$firstname,1,1), patents$lastname, sep = '')
nsf$flast <- paste(substring(nsf$firstname,1,1), nsf$lastname, sep = '')

Case Study: Performing Linkage

  • Create comparison vectors for pairs after blocking
require(RecordLinkage)
a <- compare.linkage(nsf, patents, blockfld = c("state"), strcmp = T, exclude=c(1))
print(head(a$pairs))
  id1 id2 firstname  lastname      city state organization     flast
1 671 445 0.5952381 0.4444444 0.4142857     1    0.5588337 0.4285714
2 671 864 0.5277778 0.5000000 0.4222222     1    0.4780193 0.4650794
3 671 444 0.5952381 0.0000000 0.4650794     1    0.5588337 0.0000000
4 643 445 0.5619048 0.5750000 0.0000000     1    0.5395022 0.5026455
5 643 864 0.5111111 0.5648148 0.3444444     1    0.5277778 0.5314815
6 643 444 0.5619048 0.0000000 0.5857143     1    0.5395022 0.0000000
  is_match
1       NA
2       NA
3       NA
4       NA
5       NA
6       NA

Case Study: Performing Linkage contd.

  • Calculate \( M \) and \( U \) weights using the EM algorithm
b <- emWeights(a, cutoff = 0.8)
  • Define \( 0.8 \) as cut off for string comparators, converting distance metrics into 0/1 binary
  • Initial estimates of \( M \) and \( U \) are set by the algorithm using frequencies of specific values for each column.

Case Study: Performing Linkage contd.

  • Summary of weights calculated by EM algorithm can be obtained
summary(b)

Linkage Data Set

1000 records in data set 1 
1000 records in data set 2 
45552 record pairs 

0 matches
0 non-matches
45552 pairs with unknown status


Weight distribution:

[-10,-8]  (-8,-6]  (-6,-4]  (-4,-2]   (-2,0]    (0,2]    (2,4]    (4,6] 
   43111     2202        0       42        0       22        0        0 
   (6,8]   (8,10]  (10,12]  (12,14]  (14,16]  (16,18]  (18,20] 
       0       75        5        0        2        0       16 

Case Study: Determining threshold

  • Weights valid only within the context of the problem
  • Manually look at the matches to determine threshold
allPairs <- getPairs(b)
head(allPairs)
    id     seq firstname lastname       city state
1  877  474768   ABIGAIL   LEVINE LOSANGELES    CA
2  754 2184578     GREGG    LEVIN   LOSALTOS    CA
3                                                 
4  936   95632   ABRAHAM    KLEIN  ARLINGTON    VA
5  658 3056777 ALEXANDER    KLEIN   RICHMOND    VA
6                                                 
                      organization   flast     Weight
1 UNIVERSITYOFCALIFORNIALOSANGELES ALEVINE           
2      BRIDGEWAVECOMMUNICATIONSINC  GLEVIN 19.9658993
3                                                    
4                      TRAVELAWARD  AKLEIN           
5                   ALEXANDERKLEIN  AKLEIN 19.1687678
6                                                    

Case Study: Get Links

  • Determine two thresholds
  • Obtain doubtable cases and sure cases
  • Sample records for manual validation to determine error rates
finalPairs <- getPairs(b, max.weight = 14, min.weight = 0)
head(finalPairs)
    id     seq firstname   lastname         city state
1  247  318499    AKEITH     DUNKER PHILADELPHIA    PA
2  428 1000743   EDMUNDM       DUNN PHILADELPHIA    PA
3                                                     
4  570  354481     AARON PIETRUSZKA     SANDIEGO    CA
5  901 2437552   FABRICE        PIU     SANDIEGO    CA
6                                                     
                       organization       flast     Weight
1                  TEMPLEUNIVERSITY     ADUNKER           
2          MUTUALINDUSTRIESNORTHINC       EDUNN 10.5823350
3                                                         
4 SANDIEGOSTATEUNIVERSITYFOUNDATION APIETRUSZKA           
5                        OTONOMYINC        FPIU 10.5617876
6                                                         

Issues

  • Missing Data
  • Sample bias
  • Propogation of errors into analyses

Machine Learning Approaches

  • Using supervised learning algorithms for record linkage
  • Automatic creation of training datasets

    Machine learning vs Probabilistic Linkage
    alt text *** AHMAD EMAD

Machine Learning Approaches Contd.

  • Each row in the training data contains output from a record comparison plus a class label
  • Can input lots of variables that were not directly comparable in Fellegi-Sunter
  • Use measures of variable importance in supervised algorithms like Random Forest
  • Can fit a multi-class model based on probabilistic match scores (should help correctly classify edge cases)
  • Need to have a training dataset

Training Data

  • Can be generated manually
  • If we have additional information for subsets of data, we can use that to generate a training dataset automatically

Conclusions

  • Without a significant amount of clerical review, it???s hard to understand linkage quality
  • RecordLinkage package makes it simple to obtain initial links
  • Given a set of reliable links that are representative, we can study linkage parameters and attempt to train more sophisticated models.

Further Reading

  • Andreas Borg and Murat Sariyar (2015). RecordLinkage: Record Linkage in R.
  • Handbook of Vital Statistics Systems and Methods, Volume 1: Legal, Organisational and Technical Aspects, United Nations Studies in Methods, Glossary, Series F, No. 35, United Nations, New York 1991.
  • Breiman, Leo. “Random forests.” Machine learning 45.1 (2001): 5-32.
  • Fellegi, Ivan P., and Alan B. Sunter. “A theory for record linkage.” Journal of the American Statistical Association 64.328 (1969): 1183-1210.
  • Ventura, Samuel L., Rebecca Nugent, and Erica RH Fuchs. “Seeing the non-stars:(Some) sources of bias in past disambiguation approaches and a new public tool leveraging labeled records.” Research Policy 44.9 (2015): 1672-1701.

Further Reading

  • Winkler, William E. “Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage.” Proceedings of the Section on Survey Research Methods, American Statistical Association. Vol. 667. 1988.
  • Treeratpituk, Pucktada, and C. Lee Giles. “Disambiguating authors in academic publications using random forests.” Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries. ACM, 2009.