R for Record Linkage

Ahmad Emad & Celeste Stone
February 18th, 2016

Definition

determine if pairs of data records describe the same entity
Entities: usually people, organizations
Data records: names, addresses, date of birth, etc

Applications

Matching two disparate data sources with no common unique identifiers
Deduplicating entities within a dataset

Techniques

Manual

Not feasible as the number of records grows

Deterministic

Automatic comparisons where either everything needs to match, or specific data specific rules are programmed.
Needs new rule for every variation in data
Not generalizable to other datasets

Probabilistic

Estimate probability/likelihood two entities are same
Handles missing data, variations in coding

Probabilistic Matching Workflow

Preprocessing: developing link keys, extracting information from link keys, normalization of link keys
Reduction of search space: Blocking
Comparison: String metrics, year comparisons, numeric comparisons
Classification: Fellegi-Sunter Model
Final prediction: cut off scores, validation

Fellegi-Sunter Model

In the simplest version of the model:

Two sets to be linked \( A \) and \( B \) where \( A \cup B = M \times U \) (matches and nonmatches)
A binary vector-valued comparison function \( \gamma : A \times B \rightarrow {0,1}^d \)
There are weights \( m_j \) such that if \( g \) is a particular comparison outcome and \( z_{jk} \) is an indicator which is equal to 1, assume \[ P(\gamma = g\vert M) \prod_{j=1}^{d} P(\gamma_j = g_j\vert M) = \prod_{j=1}^{d} {m_j}^{g_{j}} (1-m_j)^{1-g_{j}} \]
\( P(\gamma = g\vert M) \) is defined similarly.
Weights \( m_j \) can be fitted using expectation-maximization (EM) or Bayesian methods.
Extends naturally to comparison outcomes with more than 2 levels

Fellegi-Sunter Model

To determine if \( a \in A \) and \( b \in B \) represent the same individual, compute the comparison vector \( \gamma(a, b) = g \)
and take the log-likelihood ratio (match score):

\[ log L = log P(\gamma=g\vert M) - log P(\gamma=g\vert U) \]
More likely matches will have a higher match score. By studying the data, a threshold is found such that pairs with a match score above the threshold are considered matches.

String Comparators

Jaro distance

– calculates the number \( m \) of common characters that are within half the length of the longer string and the number of transpositions \( t \)

Jaro dist = \( \frac{1}{3} (\dfrac{m}{\vert s_1 \vert}+ \dfrac{m}{\vert s_2 \vert} + \dfrac{m-t}{\vert m\vert}) \)

Jaro-Winkler distance

– improves upon the Jaro algorithm by applying ideas based on empirical studies – Fewer errors occur at the beginning of strings

Jaro-winkler dist = \( Jaro\;dist + l(1-Jaro\;dist) \)
\( l \) is the length of common prefix at the start of the string up to a maximum of 4 characters

String Comparators

Levenshtein Distance

– Minimum number of edits needed to convert \( s_1 \) into \( s_2 \)
– Normalized by dividing by length of longer string
– Not widely used

Phonetic (Soundex)

– indexing names by sound, as pronounced in English
– Mainly encodes consonants, vowels are considered unless it is the first letter
– Do not work well on Asian names

Need to convert string distance metrics to discrete values.

Case Study: Example Research Question

How many projects at the National Science Foundation (NSF) result in patents?

Send out a survey to NSF researchers, time consuming and non response bias.
Link researchers from NSF to inventors in USPTO database.

Case Study: Input Data

Sources of Data

NSF

– https://www.nsf.gov/awardsearch/download.jsp
– Data on NSF awards and researchers
– 230,531 individual researchers

USPTO

– http://www.patentsview.org/download/
– Data on patents and inventors in the US since 1976
– 2,019,077 inventors

Case Study: Input Data

Relevant fields

– First name, last name, institution information, City name, state name

Blocking fields

– Use first initial, last name and state as blocking fields.

Case Study First look at the data

nsf <- read.csv("./data/nsf.csv", nrows = 1000)
print(head(nsf))

  InvestigatorId FirstName       LastName
1          62182        *. Atta-Ur-Rahman
2          60555        *. Atta-Ur-Rahman
3         411085         -          Robby
4         353742        -.           None
5          79724         A          Agate
6         271298         A          Anwar
                                     Name        CityName StateCode
1 Pennsylvania State Univ University Park UNIVERSITY PARK        PA
2                   University of Karachi         Karachi      NULL
3                 Kansas State University       MANHATTAN        KS
4        Rutgers University New Brunswick   NEW BRUNSWICK        NJ
5                        Individual Award       Baltimore        MD
6               University of Connecticut          Storrs        CT

Case Study: First look at the data contd

patents <- read.csv("./data/patents2.csv", nrows = 1000)
print(head(patents))

      seq     firstname      lastname                 city state
1 2931635          Eryk         Stacy            Silverado    CA
2 2931636    Donald Lee        Brisco        Lake Elsinore    CA
3  509994       Kirk R.          Hyde Palos Verdes Estates    CA
4  898216         Ralph         Jelic             Valencia    PA
5 3475416 Thomas Joseph       Wozniak          Minneapolis    MN
6 3401630    Domenic A. Tortolano Jr.             Johnston    RI
                         organization
1   Fleetwood Aluminum Products, Inc.
2   Fleetwood Aluminum Products, Inc.
3    Kirkplastic Company Incorporated
4 International Window Fashions, Inc.
5                     Terrybear, Inc.
6                   D. Gioielli, Inc.

Case Study: Preprocessing the data

Identifying common fields

patents <- patents[,c("seq", "firstname", "lastname", "city", "state", "organization")]
nsf <- nsf[, c("InvestigatorId", "FirstName", "LastName", "CityName", "StateCode", "Name")]
names(nsf) <- names(patents)

Convert to Upper Case

patents[,-1] <- as.data.frame(sapply(patents[,-1], toupper))
nsf[,-1] <- as.data.frame(sapply(nsf[,-1], toupper))

Case Study: Preprocessing the data contd.

Remove Punctuations

patents[,-1] <- as.data.frame(sapply(patents[,-1], function(x) gsub("[[:punct:]]", "", x)))
nsf[,-1] <- as.data.frame(sapply(nsf[,-1], function(x) gsub("[[:punct:]]", "", x)))
patents[,-1] <- as.data.frame(sapply(patents[,-1], function(x) gsub(" ", "", x)))

-Remove user defined list of common words

toRemove <- c(" JR", " SR", " II", " III", " IV")
for (tR in toRemove) {
    patents$firstname <- gsub(tR, "", patents$firstname)
    patents$lastname <- gsub(tR, "", patents$lastname)
    nsf$firstname <- gsub(tR, "", nsf$firstname)
    nsf$lastname <- gsub(tR, "", nsf$lastname)
}

Case Study: Preprocessing the data contd.

-Remove spaces

patents[,-1] <- as.data.frame(sapply(patents[,-1], function(x) gsub(" ", "", x)))

nsf[,-1] <- as.data.frame(sapply(nsf[,-1], function(x) gsub(" ", "", x)))

Create “blocking” field

patents$flast <- paste(substring(patents$firstname,1,1), patents$lastname, sep = '')
nsf$flast <- paste(substring(nsf$firstname,1,1), nsf$lastname, sep = '')

Case Study: Performing Linkage

Create comparison vectors for pairs after blocking

require(RecordLinkage)
a <- compare.linkage(nsf, patents, blockfld = c("state"), strcmp = T, exclude=c(1))
print(head(a$pairs))

  id1 id2 firstname  lastname      city state organization     flast
1 671 445 0.5952381 0.4444444 0.4142857     1    0.5588337 0.4285714
2 671 864 0.5277778 0.5000000 0.4222222     1    0.4780193 0.4650794
3 671 444 0.5952381 0.0000000 0.4650794     1    0.5588337 0.0000000
4 643 445 0.5619048 0.5750000 0.0000000     1    0.5395022 0.5026455
5 643 864 0.5111111 0.5648148 0.3444444     1    0.5277778 0.5314815
6 643 444 0.5619048 0.0000000 0.5857143     1    0.5395022 0.0000000
  is_match
1       NA
2       NA
3       NA
4       NA
5       NA
6       NA

Case Study: Performing Linkage contd.

Calculate \( M \) and \( U \) weights using the EM algorithm

b <- emWeights(a, cutoff = 0.8)

Define \( 0.8 \) as cut off for string comparators, converting distance metrics into 0/1 binary
Initial estimates of \( M \) and \( U \) are set by the algorithm using frequencies of specific values for each column.

Case Study: Performing Linkage contd.

Summary of weights calculated by EM algorithm can be obtained

summary(b)


Linkage Data Set

1000 records in data set 1 
1000 records in data set 2 
45552 record pairs 

0 matches
0 non-matches
45552 pairs with unknown status


Weight distribution:

[-10,-8]  (-8,-6]  (-6,-4]  (-4,-2]   (-2,0]    (0,2]    (2,4]    (4,6] 
   43111     2202        0       42        0       22        0        0 
   (6,8]   (8,10]  (10,12]  (12,14]  (14,16]  (16,18]  (18,20] 
       0       75        5        0        2        0       16

Case Study: Determining threshold

Weights valid only within the context of the problem
Manually look at the matches to determine threshold

allPairs <- getPairs(b)
head(allPairs)

    id     seq firstname lastname       city state
1  877  474768   ABIGAIL   LEVINE LOSANGELES    CA
2  754 2184578     GREGG    LEVIN   LOSALTOS    CA
3                                                 
4  936   95632   ABRAHAM    KLEIN  ARLINGTON    VA
5  658 3056777 ALEXANDER    KLEIN   RICHMOND    VA
6                                                 
                      organization   flast     Weight
1 UNIVERSITYOFCALIFORNIALOSANGELES ALEVINE           
2      BRIDGEWAVECOMMUNICATIONSINC  GLEVIN 19.9658993
3                                                    
4                      TRAVELAWARD  AKLEIN           
5                   ALEXANDERKLEIN  AKLEIN 19.1687678
6

Case Study: Get Links

Determine two thresholds
Obtain doubtable cases and sure cases
Sample records for manual validation to determine error rates

finalPairs <- getPairs(b, max.weight = 14, min.weight = 0)
head(finalPairs)

    id     seq firstname   lastname         city state
1  247  318499    AKEITH     DUNKER PHILADELPHIA    PA
2  428 1000743   EDMUNDM       DUNN PHILADELPHIA    PA
3                                                     
4  570  354481     AARON PIETRUSZKA     SANDIEGO    CA
5  901 2437552   FABRICE        PIU     SANDIEGO    CA
6                                                     
                       organization       flast     Weight
1                  TEMPLEUNIVERSITY     ADUNKER           
2          MUTUALINDUSTRIESNORTHINC       EDUNN 10.5823350
3                                                         
4 SANDIEGOSTATEUNIVERSITYFOUNDATION APIETRUSZKA           
5                        OTONOMYINC        FPIU 10.5617876
6

Issues

Missing Data
Sample bias
Propogation of errors into analyses

Machine Learning Approaches

Using supervised learning algorithms for record linkage
Automatic creation of training datasets

Machine learning vs Probabilistic Linkage
*** AHMAD EMAD

Machine Learning Approaches Contd.

Each row in the training data contains output from a record comparison plus a class label
Can input lots of variables that were not directly comparable in Fellegi-Sunter
Use measures of variable importance in supervised algorithms like Random Forest
Can fit a multi-class model based on probabilistic match scores (should help correctly classify edge cases)
Need to have a training dataset

Training Data

Can be generated manually
If we have additional information for subsets of data, we can use that to generate a training dataset automatically

Conclusions

Without a significant amount of clerical review, it???s hard to understand linkage quality
RecordLinkage package makes it simple to obtain initial links
Given a set of reliable links that are representative, we can study linkage parameters and attempt to train more sophisticated models.

R for Record Linkage

Techniques

Probabilistic Matching Workflow

Probabilistic Matching Workflow

Fellegi-Sunter Model

Fellegi-Sunter Model

String Comparators

Jaro distance

Jaro-Winkler distance

String Comparators

Levenshtein Distance

Phonetic (Soundex)

Case Study: Example Research Question

Case Study: Input Data

Sources of Data

NSF

USPTO

Case Study: Input Data

Relevant fields

Blocking fields

Case Study First look at the data

Case Study: First look at the data contd

Case Study: Preprocessing the data

Case Study: Preprocessing the data contd.

Case Study: Preprocessing the data contd.

Case Study: Performing Linkage

Case Study: Performing Linkage contd.

Case Study: Performing Linkage contd.

Case Study: Determining threshold

Case Study: Get Links

Issues

Machine Learning Approaches

Machine Learning Approaches Contd.

Training Data

Conclusions

Further Reading

Further Reading