8/21/2019

Ground Truth: all variables with correct correspondence

  • Import the original data and check perform a LS regression
  • This regression is the oracle model
wages_f <- read.csv("../data/fakedata/wages_df.csv", header = TRUE)
lm0 <- lm(log(WAGE) ~ SEX + EXPERIENCE + I(EXPERIENCE^2)  + 
            EDUCATION + as.factor(OCCUPATION) + UNION,
          data = wages_f)

R summary

coef(lm0)
##            (Intercept)                    SEX             EXPERIENCE 
##           1.0094294300          -0.2135490993           0.0299532046 
##        I(EXPERIENCE^2)              EDUCATION as.factor(OCCUPATION)2 
##          -0.0004441696           0.0730130844          -0.3161433765 
## as.factor(OCCUPATION)3 as.factor(OCCUPATION)4 as.factor(OCCUPATION)5 
##          -0.2262765855          -0.3658979172          -0.0450243656 
## as.factor(OCCUPATION)6                  UNION 
##          -0.1964030978           0.1998655883

Importing wages data files

  • First import the data files.
wages_dA <- read.csv("../data/fakedata/wages_dA.csv", header = TRUE)
wages_dB <- read.csv("../data/fakedata/wages_dB.csv", header = TRUE)
  • Check both data files
  • What variables are approriate for linkage?
## [1] "ID"      "FNAME"   "LNAME"   "SEX"     "AGE"     "ADDRESS" "ZIPCODE"
## [8] "WAGE"
##  [1] "ID"         "OCCUPATION" "SECTOR"     "UNION"      "EDUCATION" 
##  [6] "EXPERIENCE" "AGE"        "SEX"        "MARR"       "RACE"      
## [11] "SOUTH"      "FNAME"      "LNAME"      "ADDRESS"    "ZIPCODE"

Linking the data files

  • Upload the R package fastLink for linkg files

  • In this experiment we link two files based on ZIPCODE.

  • Check fastlink function first.

help(fastLink)
## starting httpd help server ... done
  • To specify matching variables we choose varnames = “ZIPCODE”.
set.seed(1427)
wages_link<-fastLink(wages_dA, wages_dB, varnames = "ZIPCODE")
## 
## ==================== 
## fastLink(): Fast Probabilistic Record Linkage
## ==================== 
## 
## Calculating matches for each variable.
## Getting counts for parameter estimation.
##     Parallelizing calculation using OpenMP. 1 threads out of 8 are used.
## Running the EM algorithm.
## Getting the indices of estimated matches.
##     Parallelizing calculation using OpenMP. 1 threads out of 8 are used.
## Deduping the estimated matches.
## Getting the match patterns for each estimated match.

The linked data

  • We can get the linked data file form *getMatch" function.
matched_wages <- getMatches(wages_dA, 
                  wages_dB, wages_link, combine.dfs = FALSE)

dA<-matched_wages$dfA.match; dB<-matched_wages$dfB.match

Merging these two files

  • First ensure unique column names in the merged file
commonvars <- intersect(colnames(dA), colnames(dB))
colnames(dA)[colnames(dA) %in% commonvars] <- 
  paste("A.", colnames(dA)[colnames(dA) %in% commonvars], sep="")
colnames(dB)[colnames(dB) %in% commonvars] <-
  paste("B.", colnames(dB)[colnames(dB) %in% commonvars], sep="")

Merge the data linked by fastLink

Check the inear regression with linked file

lm_merged <- lm(log(WAGE) ~ B.SEX + EXPERIENCE + I(EXPERIENCE^2)  + 
            EDUCATION + as.factor(OCCUPATION) + UNION,
          data = merged_wages)

Summary of the LS regression

summary(lm_merged)
## 
## Call:
## lm(formula = log(WAGE) ~ B.SEX + EXPERIENCE + I(EXPERIENCE^2) + 
##     EDUCATION + as.factor(OCCUPATION) + UNION, data = merged_wages)
## 
## Residuals:
##               Min                1Q            Median                3Q 
## -1.19469166951699 -0.12984474066534  0.03236470248122  0.15907570777571 
##               Max 
##  1.01039290953333 
## 
## Coefficients:
##                                   Estimate          Std. Error  t value
## (Intercept)             1.196847505769e+00  1.048946158902e-01 11.41000
## B.SEX                  -1.887882270381e-01  2.590829553157e-02 -7.28679
## EXPERIENCE              2.808217223814e-02  3.317539725598e-03  8.46476
## I(EXPERIENCE^2)        -4.280056942423e-04  7.278559429769e-05 -5.88036
## EDUCATION               5.915856155686e-02  6.189785114960e-03  9.55745
## as.factor(OCCUPATION)2 -2.575914184585e-01  5.757259778385e-02 -4.47420
## as.factor(OCCUPATION)3 -2.255277397914e-01  4.804571900078e-02 -4.69402
## as.factor(OCCUPATION)4 -3.514718608619e-01  5.056970361290e-02 -6.95025
## as.factor(OCCUPATION)5 -7.520013201953e-02  4.540895880861e-02 -1.65606
## as.factor(OCCUPATION)6 -1.809642839349e-01  4.788119925594e-02 -3.77944
## UNION                   1.802265346728e-01  3.134039879493e-02  5.75061
##                          Pr(>|t|)    
## (Intercept)            < 2.22e-16 ***
## B.SEX                  1.1774e-12 ***
## EXPERIENCE             2.5976e-16 ***
## I(EXPERIENCE^2)        7.3054e-09 ***
## EDUCATION              < 2.22e-16 ***
## as.factor(OCCUPATION)2 9.4171e-06 ***
## as.factor(OCCUPATION)3 3.4261e-06 ***
## as.factor(OCCUPATION)4 1.0899e-11 ***
## as.factor(OCCUPATION)5 0.09830872 .  
## as.factor(OCCUPATION)6 0.00017527 ***
## UNION                  1.5137e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.267948227777 on 523 degrees of freedom
## Multiple R-squared:  0.5192933279639,    Adjusted R-squared:  0.5101019958026 
## F-statistic: 56.49815705174 on 10 and 523 DF,  p-value: < 2.2204460493e-16
  • compare the result with the original regression with original data

Robust regression

  • Try robust regression with Huber loss
library(MASS)
rlm_merged <- rlm(log(WAGE) ~ B.SEX + EXPERIENCE + I(EXPERIENCE^2)  + 
            EDUCATION + as.factor(OCCUPATION) + UNION,
          data = merged_wages)
  • Comparing estimation error of naive and robust estimation
sqrt(sum((coef(lm0) - coef(lm_merged))^2))
## [1] 0.2027454431130401
sqrt(sum((coef(lm0) - coef(rlm_merged))^2))
## [1] 0.1701889571280111

Mixture model

  • Now we try the mixure modeling approach with composite likelihood:
source("../code/mixture_model.R")

X <- model.matrix(lm_merged)
X <- X[,!(colnames(X) %in% "(Intercept)")]
y <- model.extract(lm_merged$model, "response")
Xc <- apply(X, 2, function(z) z - mean(z))
yc <- y - mean(y)
tausq <- mean(yc^2)
f0 <- function(z) dnorm(z, mean = 0, sd = sqrt(tausq))

res <- fit_mixture(Xc, yc, f0, control = list(init = "robust"))
interc <- mean(y - X %*% res$betahat)

Comparing the results

  • Compute \(\| \widehat{\beta} - \widehat{\beta}_{LS}\|\), where \(\widehat{\beta}_{LS}\) denotes the estimates from the original LS regression model with original data and \(\widehat{\beta}\) is the estimate from, , Huber, or mixture models.
coef_mixture <- c(interc, res$betahat)

Comparing the results (continued)

sqrt(sum((coef(lm0) - coef_mixture)^2))
## [1] 0.0272892695833842
sqrt(sum((coef(lm0) - coef(rlm_merged))^2))
## [1] 0.1701889571280111
sqrt(sum((coef(lm0) - coef(lm_merged))^2))
## [1] 0.2027454431130401

Blocking method using information information about the linkage process

source("../code/lahiri_larsen.R")

blockix <- merged_wages$A.ZIPCODE 
barplot(table(as.numeric(table(blockix))))

Compute the estimates

beta_Q <- coef(lahiri_larsen_block(Xc, yc, blockix))
Q <- generate_Q_block(blockix) 
beta_Q_check <- lahiri_larsen(Xc, yc, Q)
    
interc <- mean(y - X %*% beta_Q)
coef_Q <- c(interc, beta_Q)
  • Compute \(\| \widehat{\beta} - \widehat{\beta}_{LS}\|\):
sqrt(sum((coef(lm0) - coef_Q)^2))
## [1] 0.07430098233549876

Re-matching based on sorting

source("../code/optimal_matching.R")

pihat <- optimal_matching(drop(Xc %*% res$betahat), yc, blockix)
pihatinv <- order(pihat)
pistar <- match(merged_wages[,"B.ID"], merged_wages[,"A.ID"])

Plot

plot(wages_f$WAGE[merged_wages[,"A.ID"]], wages_f$WAGE[merged_wages[,"B.ID"]], xlab = " wages in the linked data", ylab = "wages in the original data")
points(wages_f$WAGE[merged_wages[,"A.ID"]], wages_f$WAGE[merged_wages[pihat,"B.ID"]], col = "red")