Usos de micromatch

En este documento pretendemos mostrar unos cuantos usos de las funciones ahora mismo programadas en el paquete micromatch. Estas funciones han sido utilizadas para efectuar un ejercicio concreto entre las encuestas ECV y PRA de Eustat; ahora lo que interesa es analizar si estas funciones podrían generalizarse a otros casos.

Los datos de dos encuestas de Eustat, ECV y PRA, se han cargado en el propio paquete micromatch. Está pendiente documentarlas.

Manera (potencial) de usar micromatch con clases y metodos

Ademas de lo anterior, se analiza la posibilidad de generalizar las funciones de micromatch mediante el uso de clases y métodos. Se hace una prueba, ya incorporada en micromatch. La idea es que el usuario pudiera trabajar de la manera siguiente:

La opcion de emplear clases y metodos parece la mejor, no obstante requiere más conocimientos de programación y hay que valorar si merece la pena hacerlo.

En este documento, por tanto:

  1. Primero, valoraremos lo que ya tenemos programado, con datos de Eustat
  2. Segundo, haremos una prueba con clases y metodos

NOTA

Otros datos similares del INE se pueden obtener a traves del paquete MicroDatosEs. Queda pendiente analizar micromatch con estos datos 'nuevos'.

Etapas de un enlace de encuestas

Las funciones de micromatch se articulan de acuerdo con el esquema de las etapas de un enlace de encuestas (tomado de M. d'Orazio, M. Scanu, etc: proyecto ESSnet sobre Integración)

alt text

En micromatch distinguiremos 4 etapas, asociadas a familias de funciones

  1. Especificar objetivos del enlace: variables comunes y específicas, y resultado deseado (fichero sintético, tabla de contingencia cruzando variables específicas…)

  2. Seleccion de variables comunes de enlace (matching variables). De entre todas las variables comunes entre los ficheros nos quedaremos con aquellas que sean concordantes y que tengan utilidad para el matching. Familia: 'Select matching variables'.

  3. Enlace en si: por ejemplo, por un método hot-deck. Familia: 'Apply matching method'

  4. Validación de los resultados. Familia: 'Validate matching results'

Cargar paquete y datos

Encuestas de Eustat Datos en el mismo micromatch: 1. PRA: Población en relación con la actividad 2. ECV: Encuesta de condiciones de vida

library(micromatch)
## Loading required package: StatMatch
## Loading required package: proxy
## 
## Attaching package: 'proxy'
## 
## The following objects are masked from 'package:stats':
## 
##     as.dist, dist
## 
## Loading required package: clue
## Loading required package: survey
## 
## Attaching package: 'survey'
## 
## The following object is masked from 'package:graphics':
## 
##     dotchart
## 
## Loading required package: RANN
## Loading required package: ggplot2
## Loading required package: gridExtra
## Loading required package: grid
## Loading required package: mice
## Loading required package: Rcpp
## Loading required package: lattice
## mice 2.22 2014-06-10
## Loading required package: vcd
data(ecv)
data(pra)
dim(ecv)
## [1] 4749  417
dim(pra)
## [1] 10865    73

Etapa 1: Definir objetivos

Primero definir lo que se quiere hacer:

Listas de variables

#variables comunes candidatas
varCom <- c("ED", #Edad: estrato
            "S",  #Sexo: estrato
            "TF2", #Tamanyo familiar
            "EST", #Estudiante si (1) o no (0)
            "OCP", #Ocupado si/no
            "PAR", #Parado si/no
            "INA", #Inactivo si/no
            "BUSQ", #Buscando empleo si/no
            "DOM.com2") #Dedicacion a las tareas domesticas
#
#variables especificas ECV
varEsp <- c("SAL",  #condiciones de salud
            "IDM",  #conocimiento idiomas
            #"DOM",  #(la ponemos como variable comun) dedicacion tareas domesticas
            "NIN", #cuidado de ninyos
            "VAC", #lugar de vacaciones
            "LIB", 
            "RELI", 
            "RELF", 
            "EQP",
            "VIV",
            "VEH",
            "SRV", 
            "AMB",
            "DOM2",
            "ECO",
            "ING",
            "FIN" )
#variable especifica de PRA
vary <- "PRA22"

Etapa 2: Selección de variables

Coherencia de fuentes

Función: compareVar

Univariate comparisons: Compares one variable at a time

Only check first 2 variables.

#consultar documentacion: ?compareVar
#dar valores a los parametros fijos
fileA <- ecv
fileB <- pra
wA <- wB <- "calELE" #variable de peso calibrada
#
# all comparisons for a list of common vars at once
#absolute values, no plotting, no empirical measures
sapply(X=1:2, FUN=function(x){
        print(paste('Variable: ',varCom[x]))
        varA <- varB <- varCom[x]
        c <- compareVar(varA=varA, varB=varB,fileA=fileA,fileB=fileB,wA=wA, wB=wB,plot=FALSE,measures=FALSE,type="abs")
        print(c)
})
## [1] "Variable:  ED"
## $`table for file #1`
## x1
## (15,24] (24,34] (34,44] (44,54] (54,64]     65+     Sum 
##  177684  327581  355344  325660  273009  393715 1852993 
## 
## $`table for file #2`
## x2
## (15,24] (24,34] (34,44] (44,54] (54,64]     65+     Sum 
##  171921  319528  355384  328303  270392  423928 1869456 
## 
## $measures
## NULL
## 
## [1] "Variable:  S"
## $`table for file #1`
## x1
##       H       M     Sum 
##  911419  941573 1852992 
## 
## $`table for file #2`
## x2
##       H       M     Sum 
##  903821  965634 1869455 
## 
## $measures
## NULL
##                   [,1]      [,2]     
## table for file #1 Numeric,7 Numeric,3
## table for file #2 Numeric,7 Numeric,3
## measures          NULL      NULL
#relative values, plotting, with empirical measures (Hellinger's Distance, etc)
sapply(X=1:2, FUN=function(x){
        print(paste('Variable: ',varCom[x]))
        varA <- varB <- varCom[x]
        #relative values, with plot
        c<- compareVar(varA=varA, varB=varB,fileA=fileA,fileB=fileB,wA=wA, wB=wB,plot=TRUE,measures=TRUE,type="rel")
        print(c)
})
## [1] "Variable:  ED"
## ymax not defined: adjusting position using y instead

plot of chunk concordancia

## $`table for file #1`
## x1
## (15,24] (24,34] (34,44] (44,54] (54,64]     65+     Sum 
##    9.59   17.68   19.18   17.57   14.73   21.25  100.00 
## 
## $`table for file #2`
## x2
## (15,24] (24,34] (34,44] (44,54] (54,64]     65+     Sum 
##    9.20   17.09   19.01   17.56   14.46   22.68  100.00 
## 
## $measures
##     tvd overlap   Bhatt    Hell 
## 0.01429 0.98571 0.99983 0.01302 
## 
## [1] "Variable:  S"
## ymax not defined: adjusting position using y instead

plot of chunk concordancia

## $`table for file #1`
## x1
##      H      M    Sum 
##  49.19  50.81 100.00 
## 
## $`table for file #2`
## x2
##      H      M    Sum 
##  48.35  51.65 100.00 
## 
## $measures
##      tvd  overlap    Bhatt     Hell 
## 0.008396 0.991604 0.999965 0.005939
##                   [,1]      [,2]     
## table for file #1 Numeric,7 Numeric,3
## table for file #2 Numeric,7 Numeric,3
## measures          Numeric,4 Numeric,4
#To analyze all variables in varCom list, use sapply(X=1:length(varCom), FUN=...)

Coherence of variables by strata (facets) in separate sources

Función: compareMultivar

Elegimos una variable de muestra: el indicador de Estudiante S/N. Para estudiar todas a la vez se emplearía, como antes, un sapply.

#consultar documentacion ?compareMultivar
var1A <- var1B <- varCom[2] #S estrato
var2A <- var2B <- varCom[4] #variable a estudiar (dependiente)
var3A <- var3B <- varCom[1] #ED  estrato
#absolute values, no measures, no plotting
compareMultivar(var1A=var1A,var1B=var1B,var2A=var2A,var2B=var2B,var3A=var3A,var3B=var3B,fileA=fileA, fileB=fileB, type="abs",measures=FALSE,wA=wA, wB=wB, plot=FALSE)
## $`table for file #1`
##          z1 (15,24] (24,34] (34,44] (44,54] (54,64]    65+
## x1 y1                                                     
## H  FALSE      40566  153508  179269  155768  133761 173095
##    TRUE       51073   13199    2340    7283     896    660
## M  FALSE      36277  149322  164325  158928  136252 218578
##    TRUE       49768   11552    9409    3680    2099   1382
## 
## $`table for file #2`
##          z2 (15,24] (24,34] (34,44] (44,54] (54,64]    65+
## x2 y2                                                     
## H  FALSE      33498  159737  181957  162409  130940 173849
##    TRUE       55454    5136     408     151     143    139
## M  FALSE      24872  144742  170772  164745  138275 249323
##    TRUE       58098    9914    2246     997    1033    617
## 
## $measures
## NULL
#relative values, with measures, plotting
compareMultivar(var1A=var1A,var1B=var1B,var2A=var2A,var2B=var2B,var3A=var3A,var3B=var3B,fileA=fileA, fileB=fileB, type="rel",measures=TRUE,wA=wA, wB=wB, plot=TRUE)

plot of chunk concordancia2var plot of chunk concordancia2var

## $`table for file #1`
##          z1 (15,24] (24,34] (34,44] (44,54] (54,64]   65+
## x1 y1                                                    
## H  FALSE       2.19    8.28    9.67    8.41    7.22  9.34
##    TRUE        2.76    0.71    0.13    0.39    0.05  0.04
## M  FALSE       1.96    8.06    8.87    8.58    7.35 11.80
##    TRUE        2.69    0.62    0.51    0.20    0.11  0.07
## 
## $`table for file #2`
##          z2 (15,24] (24,34] (34,44] (44,54] (54,64]   65+
## x2 y2                                                    
## H  FALSE       1.79    8.54    9.73    8.69    7.00  9.30
##    TRUE        2.97    0.27    0.02    0.01    0.01  0.01
## M  FALSE       1.33    7.74    9.13    8.81    7.40 13.34
##    TRUE        3.11    0.53    0.12    0.05    0.06  0.03
## 
## $measures
##     tvd overlap   Bhatt    Hell 
## 0.03319 0.96681 0.99588 0.06418

Assess predictive value of concordant variables

Function: predictvalue

For each common variable found to be concordant between the files, we assess its predictive value with respect to the chosen specific variables. One variable at a time.

Note that in each assessment, only one of the files is used, i.e. the one that contains the specific variable to predict.

#choose values
varA <- "EST" #variable to assess, name in file B
data <- pra
varw <- "calELE"
#
#table with absolute values, no measures, no plotting
predictvalue(varx=varA, vary=vary, data=data,varw=varw,plot=FALSE,measures=FALSE,type="abs")
## $Table
##        y
## x       Occupied Unemployed (unpaid work) Unemployed (strict)
##   FALSE   934865                    48577               38391
##   TRUE     12506                     3285                   0
##   Sum     947371                    51862               38391
##        y
## x       Non-active (unpaid work, students) Inactive or retired     Sum
##   FALSE                             508594              204692 1735119
##   TRUE                              117930                 615  134336
##   Sum                               626524              205307 1869455
## 
## $Measures
## [1] "measured not requested"
#table with relative values, measures, plotting
predictvalue(varx=varA, vary=vary, data=data,varw=varw,plot=TRUE,measures=TRUE,type="rel")

plot of chunk predictvalue

## $Table
##        y
## x       Occupied Unemployed (unpaid work) Unemployed (strict)
##   FALSE    53.88                     2.80                2.21
##   TRUE      9.31                     2.45                0.00
##   Sum      63.19                     5.25                2.21
##        y
## x       Non-active (unpaid work, students) Inactive or retired    Sum
##   FALSE                              29.31               11.80 100.00
##   TRUE                               87.79                0.46 100.01
##   Sum                               117.10               12.26 200.01
## 
## $Measures
## $Measures$V
##    y.x 
## 0.3219 
## 
## $Measures$lambda
##    y.x 
## 0.1143 
## 
## $Measures$tau
##     y.x 
## 0.05982 
## 
## $Measures$U
##     y.x 
## 0.04586

Function: uncertvarxvary

Otra opción interesante para la selección de variables es el cálculo de bandas de incertidumbre. La idea es seleccionar las variables comunes que más reduzcan la incertidumbre (anchura de las bandas) el relacionar las variables específicas de las encuestas.

data1 <- ecv
data2 <- pra
basedata <- pra
varw1 <- "calELE"
varw2 <- "calELE"
varx <- varEsp[1]
vary <- "PRA22"
varlist <- varCom[1:7] #restricted list of common variables, just to check
varlist
## [1] "ED"  "S"   "TF2" "EST" "OCP" "PAR" "INA"
uncertvarxvary(varx=varx,vary=vary,data1=data1,data2=data2,basedata=basedata,varw1=varw1,varw2=varw2,varlist=varlist)
## $Best
## [1] "|ED+S+TF2+EST+OCP+PAR"
## 
## $NumberVariables
## [1] 6
## 
## $NumberCells
## [1] 288
## 
## $OvUncert
## [1] 0.05638

NOTE Note that warnings are generated because observed distributions are not concordant between the files when crossing so many variables. Uncertainty evaluation might make more sense in specific stratum and with less variables, controlling that a minimum level of coherence between the sources is satisfied.

Etapa 3: Aplicar un método de enlace

Function: nnhdbystrata

Generalmente, el enlace se hará por estratos. En el caso de ECV-PRA hemos usado 12 grupos de edad y sexo, y dentro de cada uno se ha realizado un hot-deck independiente, usando variables distintas cada vez.

Mostramos un ejemplo.

#check documentation ?nnhdbystrata
#select stratum
i <- 1 ##select here: values between 1-12
strata <- as.factor(levels(ecv$EDS))
strata.sel <- strata[i]
strata.sel #this is selected stratum
## [1] H.(15,24]
## 12 Levels: H.(15,24] H.(24,34] H.(34,44] H.(44,54] H.(54,64] ... M.65+
#
#Select variables for this stratum 
# The decision must be grounded on the previous phase
matchvars <- c("EST", "BUSQ")
matchvars
## [1] "EST"  "BUSQ"
#
#donante y receptor
don <- pra[which(pra$EDS  == strata.sel ), ]
rec <- ecv[which(ecv$EDS  == strata.sel ), ]
wA <- wB <- "calELE"
vary <- "PRA22"
#
fused.1 <- nnhdbystrata(rec=rec,don=don,stratalevel=strata.sel,stratavar="EDS",matchvars=matchvars,vary=vary, checkdiffs=TRUE)
##          fused$EST.don
## fused$EST FALSE TRUE
##     FALSE    69    0
##     TRUE      0   88
##           fused$BUSQ.don
## fused$BUSQ FALSE TRUE
##      FALSE   133    0
##      TRUE      2   22
#
#check output (a data frame with fused file)
#str(fused.1)
#names(fused.1) ## ECV con la variable PRA22 adicional
dim(fused.1) #only for strata #1
## [1] 157 418

NOTE #1 Note that the by setting the logical parameter checkdiff to TRUE, a table is generated for each selected matching variable to check for differences between the values in that variable for the recipient-donor files. If all values are placed in the diagonal then all recipient-donor pairs will be equal in that variable.

NOTE #2 In this example, the fused file was only obtained for the first stratum. The usage is to perform hot-deck for all the strata, store results in 12 data frames and then appy rbind to get a fused file with all the rows of the original recipient file.

Etapa 4: Validar resultados

The same functions used for variable selection (concordance assessment) can be used to compare distributions of observed/imputed variables. Thus, these function belong to two families in the package.

Function: compareVar

#compare observed/imputed distributions
varA <- varB <- "PRA22"
wA <- wB <- "calELE"
fileA <- fused.1
fileB <- don
#
compareVar(varA=varA,varB=varB,fileA=fileA,fileB=don,wA=wA,wB=wB,plot=TRUE,type="rel",measures=TRUE)
## ymax not defined: adjusting position using y instead

plot of chunk validateResults

## $`table for file #1`
## x1
##                           Occupied           Unemployed (unpaid work) 
##                              30.47                               3.72 
##                Unemployed (strict) Non-active (unpaid work, students) 
##                               8.42                              52.41 
##                Inactive or retired                                Sum 
##                               4.98                             100.00 
## 
## $`table for file #2`
## x2
##                           Occupied           Unemployed (unpaid work) 
##                              25.22                               1.98 
##                Unemployed (strict) Non-active (unpaid work, students) 
##                               7.63                              60.45 
##                Inactive or retired                                Sum 
##                               4.72                             100.00 
## 
## $measures
##     tvd overlap   Bhatt    Hell 
## 0.08042 0.91958 0.99585 0.06443

Function: compareMultivar

#compare observed/imputed distributions by levels of a second variable
var1A <- var1B <- "PRA22"
var2A <- var2B <- "EST"
wA <- wB <- "calELE"
fileA <- fused.1
fileB <- don
compareMultivar(var1A=var1A,var1B=var1B,var2A=var2A,var2B=var2B,fileA=fileA,fileB=fileB,wA=wA,wB=wB,plot=TRUE,type="rel",measures=TRUE)

plot of chunk unnamed-chunk-1 plot of chunk unnamed-chunk-1

## $`table for file #1`
##                                    y1 FALSE  TRUE
## x1                                               
## Occupied                              27.28  3.19
## Unemployed (unpaid work)               1.45  2.27
## Unemployed (strict)                    8.42  0.00
## Non-active (unpaid work, students)     2.13 50.28
## Inactive or retired                    4.98  0.00
## 
## $`table for file #2`
##                                    y2 FALSE  TRUE
## x2                                               
## Occupied                              22.35  2.87
## Unemployed (unpaid work)               1.32  0.66
## Unemployed (strict)                    7.63  0.00
## Non-active (unpaid work, students)     1.87 58.58
## Inactive or retired                    4.48  0.24
## 
## $measures
##     tvd overlap   Bhatt    Hell 
## 0.08537 0.91463 0.99334 0.08163

Extensions: What if we create a new, specific class for matching?

Class: matchdesign

Define a matchdesign class that encapsulates the type of matching that we want. To this class, we will pass:

rec <- ecv
don <- pra
matchvars <- varCom[-c(1,2)]#Eliminamos ED, S que son de estrato
donvars <- c("PRA22")
recvars <- "SAL"
stratavar <- "ED"
#d1 <- new("matchdesign")
d1 <- new("matchdesign",rec=rec, 
                don=don, 
                matchvars=matchvars, 
                donvars=donvars,
                recvars=recvars,
                stratavar=stratavar
)
class(d1)
## [1] "matchdesign"
## attr(,"package")
## [1] "micromatch"

Dos métodos sencillos

Method: Describe

Method: Compare1

describe nos devuelve los parámetros especificados en matchdesign compare1 realiza un cálculo rápido de distancias de Hellinger para cada variable común especificada en matchdesign, por separado (sin considerar estratos)

describe(d1)
## $`Number of receptor rows:`
## [1] 4749
## 
## $`Number of donor rows:`
## [1] 10865
## 
## $`Common matching variables:`
## [1] "TF2"      "EST"      "OCP"      "PAR"      "INA"      "BUSQ"    
## [7] "DOM.com2"
## 
## $`Specific vars receptor file:`
## [1] "SAL"
## 
## $`Specific vars donor file:`
## [1] "PRA22"
## 
## $`Strata variable:`
## [1] "ED"
compare1(x=d1)
##     varCom    tvd overlap  Bhatt   Hell
## 1      TF2 0.0739  0.9261 0.9943 0.0752
## 2      EST 0.0086  0.9914 0.9999 0.0122
## 3      OCP 0.0081  0.9919      1 0.0057
## 4      PAR  0.006   0.994 0.9999 0.0099
## 5      INA 0.0021  0.9979      1 0.0015
## 6     BUSQ 0.0121  0.9879 0.9997 0.0173
## 7 DOM.com2 0.0648  0.9352 0.9978 0.0466

Probar con otros datos externos

Datos del INE

  1. EPA: Encuesta de poblacion activa Filtramos los mayores de 16 y datos de Euskadi, para poder comparar con PRA
  2. EES: Encuesta de estructura salarial
#cargar datos del INE
#microdatos ya descargados desde la web
library(MicroDatosEs)
setwd("~/Documents/micromatch-ejemplosUso/datosINE")
epa4T2009 <- epa2005( epa.file = "datos_t409" )
#
#Filtrar: 
## datos de Eusadi, mayores de 16
nrow(epa4T2009[which(epa4T2009$ccaa==16 & epa4T2009$nivel==1 ),]) #con 16 anyos o mas
epa <- epa4T2009[which(epa4T2009$ccaa==16 & epa4T2009$nivel==1 ),]
#
setwd("~/Documents/micromatch")