En este documento pretendemos mostrar unos cuantos usos de las funciones ahora mismo programadas en el paquete micromatch. Estas funciones han sido utilizadas para efectuar un ejercicio concreto entre las encuestas ECV y PRA de Eustat; ahora lo que interesa es analizar si estas funciones podrían generalizarse a otros casos.
Los datos de dos encuestas de Eustat, ECV y PRA, se han cargado en el propio paquete micromatch. Está pendiente documentarlas.
Manera (potencial) de usar micromatch con clases y metodos
Ademas de lo anterior, se analiza la posibilidad de generalizar las funciones de micromatch mediante el uso de clases y métodos. Se hace una prueba, ya incorporada en micromatch. La idea es que el usuario pudiera trabajar de la manera siguiente:
La opcion de emplear clases y metodos parece la mejor, no obstante requiere más conocimientos de programación y hay que valorar si merece la pena hacerlo.
En este documento, por tanto:
NOTA
Otros datos similares del INE se pueden obtener a traves del paquete MicroDatosEs. Queda pendiente analizar micromatch con estos datos 'nuevos'.
Las funciones de micromatch se articulan de acuerdo con el esquema de las etapas de un enlace de encuestas (tomado de M. d'Orazio, M. Scanu, etc: proyecto ESSnet sobre Integración)
En micromatch distinguiremos 4 etapas, asociadas a familias de funciones
Especificar objetivos del enlace: variables comunes y específicas, y resultado deseado (fichero sintético, tabla de contingencia cruzando variables específicas…)
Seleccion de variables comunes de enlace (matching variables). De entre todas las variables comunes entre los ficheros nos quedaremos con aquellas que sean concordantes y que tengan utilidad para el matching. Familia: 'Select matching variables'.
Enlace en si: por ejemplo, por un método hot-deck. Familia: 'Apply matching method'
Validación de los resultados. Familia: 'Validate matching results'
Encuestas de Eustat Datos en el mismo micromatch: 1. PRA: Población en relación con la actividad 2. ECV: Encuesta de condiciones de vida
library(micromatch)
## Loading required package: StatMatch
## Loading required package: proxy
##
## Attaching package: 'proxy'
##
## The following objects are masked from 'package:stats':
##
## as.dist, dist
##
## Loading required package: clue
## Loading required package: survey
##
## Attaching package: 'survey'
##
## The following object is masked from 'package:graphics':
##
## dotchart
##
## Loading required package: RANN
## Loading required package: ggplot2
## Loading required package: gridExtra
## Loading required package: grid
## Loading required package: mice
## Loading required package: Rcpp
## Loading required package: lattice
## mice 2.22 2014-06-10
## Loading required package: vcd
data(ecv)
data(pra)
dim(ecv)
## [1] 4749 417
dim(pra)
## [1] 10865 73
Primero definir lo que se quiere hacer:
Listas de variables
#variables comunes candidatas
varCom <- c("ED", #Edad: estrato
"S", #Sexo: estrato
"TF2", #Tamanyo familiar
"EST", #Estudiante si (1) o no (0)
"OCP", #Ocupado si/no
"PAR", #Parado si/no
"INA", #Inactivo si/no
"BUSQ", #Buscando empleo si/no
"DOM.com2") #Dedicacion a las tareas domesticas
#
#variables especificas ECV
varEsp <- c("SAL", #condiciones de salud
"IDM", #conocimiento idiomas
#"DOM", #(la ponemos como variable comun) dedicacion tareas domesticas
"NIN", #cuidado de ninyos
"VAC", #lugar de vacaciones
"LIB",
"RELI",
"RELF",
"EQP",
"VIV",
"VEH",
"SRV",
"AMB",
"DOM2",
"ECO",
"ING",
"FIN" )
#variable especifica de PRA
vary <- "PRA22"
Coherencia de fuentes
Función: compareVar
Univariate comparisons: Compares one variable at a time
Only check first 2 variables.
#consultar documentacion: ?compareVar
#dar valores a los parametros fijos
fileA <- ecv
fileB <- pra
wA <- wB <- "calELE" #variable de peso calibrada
#
# all comparisons for a list of common vars at once
#absolute values, no plotting, no empirical measures
sapply(X=1:2, FUN=function(x){
print(paste('Variable: ',varCom[x]))
varA <- varB <- varCom[x]
c <- compareVar(varA=varA, varB=varB,fileA=fileA,fileB=fileB,wA=wA, wB=wB,plot=FALSE,measures=FALSE,type="abs")
print(c)
})
## [1] "Variable: ED"
## $`table for file #1`
## x1
## (15,24] (24,34] (34,44] (44,54] (54,64] 65+ Sum
## 177684 327581 355344 325660 273009 393715 1852993
##
## $`table for file #2`
## x2
## (15,24] (24,34] (34,44] (44,54] (54,64] 65+ Sum
## 171921 319528 355384 328303 270392 423928 1869456
##
## $measures
## NULL
##
## [1] "Variable: S"
## $`table for file #1`
## x1
## H M Sum
## 911419 941573 1852992
##
## $`table for file #2`
## x2
## H M Sum
## 903821 965634 1869455
##
## $measures
## NULL
## [,1] [,2]
## table for file #1 Numeric,7 Numeric,3
## table for file #2 Numeric,7 Numeric,3
## measures NULL NULL
#relative values, plotting, with empirical measures (Hellinger's Distance, etc)
sapply(X=1:2, FUN=function(x){
print(paste('Variable: ',varCom[x]))
varA <- varB <- varCom[x]
#relative values, with plot
c<- compareVar(varA=varA, varB=varB,fileA=fileA,fileB=fileB,wA=wA, wB=wB,plot=TRUE,measures=TRUE,type="rel")
print(c)
})
## [1] "Variable: ED"
## ymax not defined: adjusting position using y instead
## $`table for file #1`
## x1
## (15,24] (24,34] (34,44] (44,54] (54,64] 65+ Sum
## 9.59 17.68 19.18 17.57 14.73 21.25 100.00
##
## $`table for file #2`
## x2
## (15,24] (24,34] (34,44] (44,54] (54,64] 65+ Sum
## 9.20 17.09 19.01 17.56 14.46 22.68 100.00
##
## $measures
## tvd overlap Bhatt Hell
## 0.01429 0.98571 0.99983 0.01302
##
## [1] "Variable: S"
## ymax not defined: adjusting position using y instead
## $`table for file #1`
## x1
## H M Sum
## 49.19 50.81 100.00
##
## $`table for file #2`
## x2
## H M Sum
## 48.35 51.65 100.00
##
## $measures
## tvd overlap Bhatt Hell
## 0.008396 0.991604 0.999965 0.005939
## [,1] [,2]
## table for file #1 Numeric,7 Numeric,3
## table for file #2 Numeric,7 Numeric,3
## measures Numeric,4 Numeric,4
#To analyze all variables in varCom list, use sapply(X=1:length(varCom), FUN=...)
Coherence of variables by strata (facets) in separate sources
Función: compareMultivar
Elegimos una variable de muestra: el indicador de Estudiante S/N. Para estudiar todas a la vez se emplearía, como antes, un sapply.
#consultar documentacion ?compareMultivar
var1A <- var1B <- varCom[2] #S estrato
var2A <- var2B <- varCom[4] #variable a estudiar (dependiente)
var3A <- var3B <- varCom[1] #ED estrato
#absolute values, no measures, no plotting
compareMultivar(var1A=var1A,var1B=var1B,var2A=var2A,var2B=var2B,var3A=var3A,var3B=var3B,fileA=fileA, fileB=fileB, type="abs",measures=FALSE,wA=wA, wB=wB, plot=FALSE)
## $`table for file #1`
## z1 (15,24] (24,34] (34,44] (44,54] (54,64] 65+
## x1 y1
## H FALSE 40566 153508 179269 155768 133761 173095
## TRUE 51073 13199 2340 7283 896 660
## M FALSE 36277 149322 164325 158928 136252 218578
## TRUE 49768 11552 9409 3680 2099 1382
##
## $`table for file #2`
## z2 (15,24] (24,34] (34,44] (44,54] (54,64] 65+
## x2 y2
## H FALSE 33498 159737 181957 162409 130940 173849
## TRUE 55454 5136 408 151 143 139
## M FALSE 24872 144742 170772 164745 138275 249323
## TRUE 58098 9914 2246 997 1033 617
##
## $measures
## NULL
#relative values, with measures, plotting
compareMultivar(var1A=var1A,var1B=var1B,var2A=var2A,var2B=var2B,var3A=var3A,var3B=var3B,fileA=fileA, fileB=fileB, type="rel",measures=TRUE,wA=wA, wB=wB, plot=TRUE)
## $`table for file #1`
## z1 (15,24] (24,34] (34,44] (44,54] (54,64] 65+
## x1 y1
## H FALSE 2.19 8.28 9.67 8.41 7.22 9.34
## TRUE 2.76 0.71 0.13 0.39 0.05 0.04
## M FALSE 1.96 8.06 8.87 8.58 7.35 11.80
## TRUE 2.69 0.62 0.51 0.20 0.11 0.07
##
## $`table for file #2`
## z2 (15,24] (24,34] (34,44] (44,54] (54,64] 65+
## x2 y2
## H FALSE 1.79 8.54 9.73 8.69 7.00 9.30
## TRUE 2.97 0.27 0.02 0.01 0.01 0.01
## M FALSE 1.33 7.74 9.13 8.81 7.40 13.34
## TRUE 3.11 0.53 0.12 0.05 0.06 0.03
##
## $measures
## tvd overlap Bhatt Hell
## 0.03319 0.96681 0.99588 0.06418
Assess predictive value of concordant variables
Function: predictvalue
For each common variable found to be concordant between the files, we assess its predictive value with respect to the chosen specific variables. One variable at a time.
Note that in each assessment, only one of the files is used, i.e. the one that contains the specific variable to predict.
#choose values
varA <- "EST" #variable to assess, name in file B
data <- pra
varw <- "calELE"
#
#table with absolute values, no measures, no plotting
predictvalue(varx=varA, vary=vary, data=data,varw=varw,plot=FALSE,measures=FALSE,type="abs")
## $Table
## y
## x Occupied Unemployed (unpaid work) Unemployed (strict)
## FALSE 934865 48577 38391
## TRUE 12506 3285 0
## Sum 947371 51862 38391
## y
## x Non-active (unpaid work, students) Inactive or retired Sum
## FALSE 508594 204692 1735119
## TRUE 117930 615 134336
## Sum 626524 205307 1869455
##
## $Measures
## [1] "measured not requested"
#table with relative values, measures, plotting
predictvalue(varx=varA, vary=vary, data=data,varw=varw,plot=TRUE,measures=TRUE,type="rel")
## $Table
## y
## x Occupied Unemployed (unpaid work) Unemployed (strict)
## FALSE 53.88 2.80 2.21
## TRUE 9.31 2.45 0.00
## Sum 63.19 5.25 2.21
## y
## x Non-active (unpaid work, students) Inactive or retired Sum
## FALSE 29.31 11.80 100.00
## TRUE 87.79 0.46 100.01
## Sum 117.10 12.26 200.01
##
## $Measures
## $Measures$V
## y.x
## 0.3219
##
## $Measures$lambda
## y.x
## 0.1143
##
## $Measures$tau
## y.x
## 0.05982
##
## $Measures$U
## y.x
## 0.04586
Function: uncertvarxvary
Otra opción interesante para la selección de variables es el cálculo de bandas de incertidumbre. La idea es seleccionar las variables comunes que más reduzcan la incertidumbre (anchura de las bandas) el relacionar las variables específicas de las encuestas.
data1 <- ecv
data2 <- pra
basedata <- pra
varw1 <- "calELE"
varw2 <- "calELE"
varx <- varEsp[1]
vary <- "PRA22"
varlist <- varCom[1:7] #restricted list of common variables, just to check
varlist
## [1] "ED" "S" "TF2" "EST" "OCP" "PAR" "INA"
uncertvarxvary(varx=varx,vary=vary,data1=data1,data2=data2,basedata=basedata,varw1=varw1,varw2=varw2,varlist=varlist)
## $Best
## [1] "|ED+S+TF2+EST+OCP+PAR"
##
## $NumberVariables
## [1] 6
##
## $NumberCells
## [1] 288
##
## $OvUncert
## [1] 0.05638
NOTE Note that warnings are generated because observed distributions are not concordant between the files when crossing so many variables. Uncertainty evaluation might make more sense in specific stratum and with less variables, controlling that a minimum level of coherence between the sources is satisfied.
Function: nnhdbystrata
Generalmente, el enlace se hará por estratos. En el caso de ECV-PRA hemos usado 12 grupos de edad y sexo, y dentro de cada uno se ha realizado un hot-deck independiente, usando variables distintas cada vez.
Mostramos un ejemplo.
#check documentation ?nnhdbystrata
#select stratum
i <- 1 ##select here: values between 1-12
strata <- as.factor(levels(ecv$EDS))
strata.sel <- strata[i]
strata.sel #this is selected stratum
## [1] H.(15,24]
## 12 Levels: H.(15,24] H.(24,34] H.(34,44] H.(44,54] H.(54,64] ... M.65+
#
#Select variables for this stratum
# The decision must be grounded on the previous phase
matchvars <- c("EST", "BUSQ")
matchvars
## [1] "EST" "BUSQ"
#
#donante y receptor
don <- pra[which(pra$EDS == strata.sel ), ]
rec <- ecv[which(ecv$EDS == strata.sel ), ]
wA <- wB <- "calELE"
vary <- "PRA22"
#
fused.1 <- nnhdbystrata(rec=rec,don=don,stratalevel=strata.sel,stratavar="EDS",matchvars=matchvars,vary=vary, checkdiffs=TRUE)
## fused$EST.don
## fused$EST FALSE TRUE
## FALSE 69 0
## TRUE 0 88
## fused$BUSQ.don
## fused$BUSQ FALSE TRUE
## FALSE 133 0
## TRUE 2 22
#
#check output (a data frame with fused file)
#str(fused.1)
#names(fused.1) ## ECV con la variable PRA22 adicional
dim(fused.1) #only for strata #1
## [1] 157 418
NOTE #1 Note that the by setting the logical parameter checkdiff to TRUE, a table is generated for each selected matching variable to check for differences between the values in that variable for the recipient-donor files. If all values are placed in the diagonal then all recipient-donor pairs will be equal in that variable.
NOTE #2 In this example, the fused file was only obtained for the first stratum. The usage is to perform hot-deck for all the strata, store results in 12 data frames and then appy rbind to get a fused file with all the rows of the original recipient file.
The same functions used for variable selection (concordance assessment) can be used to compare distributions of observed/imputed variables. Thus, these function belong to two families in the package.
Function: compareVar
#compare observed/imputed distributions
varA <- varB <- "PRA22"
wA <- wB <- "calELE"
fileA <- fused.1
fileB <- don
#
compareVar(varA=varA,varB=varB,fileA=fileA,fileB=don,wA=wA,wB=wB,plot=TRUE,type="rel",measures=TRUE)
## ymax not defined: adjusting position using y instead
## $`table for file #1`
## x1
## Occupied Unemployed (unpaid work)
## 30.47 3.72
## Unemployed (strict) Non-active (unpaid work, students)
## 8.42 52.41
## Inactive or retired Sum
## 4.98 100.00
##
## $`table for file #2`
## x2
## Occupied Unemployed (unpaid work)
## 25.22 1.98
## Unemployed (strict) Non-active (unpaid work, students)
## 7.63 60.45
## Inactive or retired Sum
## 4.72 100.00
##
## $measures
## tvd overlap Bhatt Hell
## 0.08042 0.91958 0.99585 0.06443
Function: compareMultivar
#compare observed/imputed distributions by levels of a second variable
var1A <- var1B <- "PRA22"
var2A <- var2B <- "EST"
wA <- wB <- "calELE"
fileA <- fused.1
fileB <- don
compareMultivar(var1A=var1A,var1B=var1B,var2A=var2A,var2B=var2B,fileA=fileA,fileB=fileB,wA=wA,wB=wB,plot=TRUE,type="rel",measures=TRUE)
## $`table for file #1`
## y1 FALSE TRUE
## x1
## Occupied 27.28 3.19
## Unemployed (unpaid work) 1.45 2.27
## Unemployed (strict) 8.42 0.00
## Non-active (unpaid work, students) 2.13 50.28
## Inactive or retired 4.98 0.00
##
## $`table for file #2`
## y2 FALSE TRUE
## x2
## Occupied 22.35 2.87
## Unemployed (unpaid work) 1.32 0.66
## Unemployed (strict) 7.63 0.00
## Non-active (unpaid work, students) 1.87 58.58
## Inactive or retired 4.48 0.24
##
## $measures
## tvd overlap Bhatt Hell
## 0.08537 0.91463 0.99334 0.08163
Class: matchdesign
Define a matchdesign class that encapsulates the type of matching that we want. To this class, we will pass:
rec <- ecv
don <- pra
matchvars <- varCom[-c(1,2)]#Eliminamos ED, S que son de estrato
donvars <- c("PRA22")
recvars <- "SAL"
stratavar <- "ED"
#d1 <- new("matchdesign")
d1 <- new("matchdesign",rec=rec,
don=don,
matchvars=matchvars,
donvars=donvars,
recvars=recvars,
stratavar=stratavar
)
class(d1)
## [1] "matchdesign"
## attr(,"package")
## [1] "micromatch"
Dos métodos sencillos
Method: Describe
Method: Compare1
describe nos devuelve los parámetros especificados en matchdesign compare1 realiza un cálculo rápido de distancias de Hellinger para cada variable común especificada en matchdesign, por separado (sin considerar estratos)
describe(d1)
## $`Number of receptor rows:`
## [1] 4749
##
## $`Number of donor rows:`
## [1] 10865
##
## $`Common matching variables:`
## [1] "TF2" "EST" "OCP" "PAR" "INA" "BUSQ"
## [7] "DOM.com2"
##
## $`Specific vars receptor file:`
## [1] "SAL"
##
## $`Specific vars donor file:`
## [1] "PRA22"
##
## $`Strata variable:`
## [1] "ED"
compare1(x=d1)
## varCom tvd overlap Bhatt Hell
## 1 TF2 0.0739 0.9261 0.9943 0.0752
## 2 EST 0.0086 0.9914 0.9999 0.0122
## 3 OCP 0.0081 0.9919 1 0.0057
## 4 PAR 0.006 0.994 0.9999 0.0099
## 5 INA 0.0021 0.9979 1 0.0015
## 6 BUSQ 0.0121 0.9879 0.9997 0.0173
## 7 DOM.com2 0.0648 0.9352 0.9978 0.0466
Datos del INE
#cargar datos del INE
#microdatos ya descargados desde la web
library(MicroDatosEs)
setwd("~/Documents/micromatch-ejemplosUso/datosINE")
epa4T2009 <- epa2005( epa.file = "datos_t409" )
#
#Filtrar:
## datos de Eusadi, mayores de 16
nrow(epa4T2009[which(epa4T2009$ccaa==16 & epa4T2009$nivel==1 ),]) #con 16 anyos o mas
epa <- epa4T2009[which(epa4T2009$ccaa==16 & epa4T2009$nivel==1 ),]
#
setwd("~/Documents/micromatch")