Ahmad Emad & Celeste Stone
February 18th, 2016
Definition
Applications
Manual
Deterministic
Probabilistic
In the simplest version of the model:
– calculates the number \( m \) of common characters that are within half the length of the longer string and the number of transpositions \( t \)
Jaro dist = \( \frac{1}{3} (\dfrac{m}{\vert s_1 \vert}+ \dfrac{m}{\vert s_2 \vert} + \dfrac{m-t}{\vert m\vert}) \)
– improves upon the Jaro algorithm by applying ideas based on empirical studies
– Fewer errors occur at the beginning of strings
Jaro-winkler dist = \( Jaro\;dist + l(1-Jaro\;dist) \)
\( l \) is the length of common prefix at the start of the string up to a maximum of 4 characters
– Minimum number of edits needed to convert \( s_1 \) into \( s_2 \)
– Normalized by dividing by length of longer string
– Not widely used
– indexing names by sound, as pronounced in English
– Mainly encodes consonants, vowels are considered unless it is the first letter
– Do not work well on Asian names
How many projects at the National Science Foundation (NSF) result in patents?
– https://www.nsf.gov/awardsearch/download.jsp
– Data on NSF awards and researchers
– 230,531 individual researchers
– http://www.patentsview.org/download/
– Data on patents and inventors in the US since 1976
– 2,019,077 inventors
– First name, last name, institution information, City name, state name
– Use first initial, last name and state as blocking fields.
nsf <- read.csv("./data/nsf.csv", nrows = 1000)
print(head(nsf))
InvestigatorId FirstName LastName
1 62182 *. Atta-Ur-Rahman
2 60555 *. Atta-Ur-Rahman
3 411085 - Robby
4 353742 -. None
5 79724 A Agate
6 271298 A Anwar
Name CityName StateCode
1 Pennsylvania State Univ University Park UNIVERSITY PARK PA
2 University of Karachi Karachi NULL
3 Kansas State University MANHATTAN KS
4 Rutgers University New Brunswick NEW BRUNSWICK NJ
5 Individual Award Baltimore MD
6 University of Connecticut Storrs CT
patents <- read.csv("./data/patents2.csv", nrows = 1000)
print(head(patents))
seq firstname lastname city state
1 2931635 Eryk Stacy Silverado CA
2 2931636 Donald Lee Brisco Lake Elsinore CA
3 509994 Kirk R. Hyde Palos Verdes Estates CA
4 898216 Ralph Jelic Valencia PA
5 3475416 Thomas Joseph Wozniak Minneapolis MN
6 3401630 Domenic A. Tortolano Jr. Johnston RI
organization
1 Fleetwood Aluminum Products, Inc.
2 Fleetwood Aluminum Products, Inc.
3 Kirkplastic Company Incorporated
4 International Window Fashions, Inc.
5 Terrybear, Inc.
6 D. Gioielli, Inc.
patents <- patents[,c("seq", "firstname", "lastname", "city", "state", "organization")]
nsf <- nsf[, c("InvestigatorId", "FirstName", "LastName", "CityName", "StateCode", "Name")]
names(nsf) <- names(patents)
patents[,-1] <- as.data.frame(sapply(patents[,-1], toupper))
nsf[,-1] <- as.data.frame(sapply(nsf[,-1], toupper))
patents[,-1] <- as.data.frame(sapply(patents[,-1], function(x) gsub("[[:punct:]]", "", x)))
nsf[,-1] <- as.data.frame(sapply(nsf[,-1], function(x) gsub("[[:punct:]]", "", x)))
patents[,-1] <- as.data.frame(sapply(patents[,-1], function(x) gsub(" ", "", x)))
-Remove user defined list of common words
toRemove <- c(" JR", " SR", " II", " III", " IV")
for (tR in toRemove) {
patents$firstname <- gsub(tR, "", patents$firstname)
patents$lastname <- gsub(tR, "", patents$lastname)
nsf$firstname <- gsub(tR, "", nsf$firstname)
nsf$lastname <- gsub(tR, "", nsf$lastname)
}
-Remove spaces
patents[,-1] <- as.data.frame(sapply(patents[,-1], function(x) gsub(" ", "", x)))
nsf[,-1] <- as.data.frame(sapply(nsf[,-1], function(x) gsub(" ", "", x)))
patents$flast <- paste(substring(patents$firstname,1,1), patents$lastname, sep = '')
nsf$flast <- paste(substring(nsf$firstname,1,1), nsf$lastname, sep = '')
require(RecordLinkage)
a <- compare.linkage(nsf, patents, blockfld = c("state"), strcmp = T, exclude=c(1))
print(head(a$pairs))
id1 id2 firstname lastname city state organization flast
1 671 445 0.5952381 0.4444444 0.4142857 1 0.5588337 0.4285714
2 671 864 0.5277778 0.5000000 0.4222222 1 0.4780193 0.4650794
3 671 444 0.5952381 0.0000000 0.4650794 1 0.5588337 0.0000000
4 643 445 0.5619048 0.5750000 0.0000000 1 0.5395022 0.5026455
5 643 864 0.5111111 0.5648148 0.3444444 1 0.5277778 0.5314815
6 643 444 0.5619048 0.0000000 0.5857143 1 0.5395022 0.0000000
is_match
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
b <- emWeights(a, cutoff = 0.8)
summary(b)
Linkage Data Set
1000 records in data set 1
1000 records in data set 2
45552 record pairs
0 matches
0 non-matches
45552 pairs with unknown status
Weight distribution:
[-10,-8] (-8,-6] (-6,-4] (-4,-2] (-2,0] (0,2] (2,4] (4,6]
43111 2202 0 42 0 22 0 0
(6,8] (8,10] (10,12] (12,14] (14,16] (16,18] (18,20]
0 75 5 0 2 0 16
allPairs <- getPairs(b)
head(allPairs)
id seq firstname lastname city state
1 877 474768 ABIGAIL LEVINE LOSANGELES CA
2 754 2184578 GREGG LEVIN LOSALTOS CA
3
4 936 95632 ABRAHAM KLEIN ARLINGTON VA
5 658 3056777 ALEXANDER KLEIN RICHMOND VA
6
organization flast Weight
1 UNIVERSITYOFCALIFORNIALOSANGELES ALEVINE
2 BRIDGEWAVECOMMUNICATIONSINC GLEVIN 19.9658993
3
4 TRAVELAWARD AKLEIN
5 ALEXANDERKLEIN AKLEIN 19.1687678
6
finalPairs <- getPairs(b, max.weight = 14, min.weight = 0)
head(finalPairs)
id seq firstname lastname city state
1 247 318499 AKEITH DUNKER PHILADELPHIA PA
2 428 1000743 EDMUNDM DUNN PHILADELPHIA PA
3
4 570 354481 AARON PIETRUSZKA SANDIEGO CA
5 901 2437552 FABRICE PIU SANDIEGO CA
6
organization flast Weight
1 TEMPLEUNIVERSITY ADUNKER
2 MUTUALINDUSTRIESNORTHINC EDUNN 10.5823350
3
4 SANDIEGOSTATEUNIVERSITYFOUNDATION APIETRUSZKA
5 OTONOMYINC FPIU 10.5617876
6