Muestras complejas con R

MUESTRAS COMPLEJAS CON STATA

Julio César Martínez Sánchez

jcms2665@gmail.com


" Las pruebas de significancia estadística son importantes porque con frecuencia se tiende a analizar los datos de una encuesta por muestreo probabilístico como si fueran los datos provenientes de un censo"

Contenido

  1. Cargar y filtrar la base

  2. Definir el esquema de muestreo
    • 1.1. Totales
    • 1.2. Promedio
    • 1.3. Proporción
  3. Subpoblaciones (Problemas)

  4. Pruebas de hipótesis
    • 3.1. Estimadores 215
    • 3.1. Estimadores 214
  5. Modelos de regresión
    • 4.1. Muestro Aleatorio Simple
    • 4.2. Muestro Estratificado y por Conglomerados
    • 4.3. Comparación entre modelos



0. Cargar y filtrar la base


Para iniciar el análisis se debe de filtrar los casos válidos, es decir, aquellos que son aquellos residentes habituales con entrevista completa y dentro del rango de edad. Para mayor detalle ver Conociendo la base de datos de la ENOE
use "C:\Users\JC\Desktop\Estadística\BUAP\sdemt215.dta", clear
gen filtro=((c_res==1 | c_res==3) & r_def==0 & (eda>=15 & eda<=98))
tab filtro [fw=fac], m
     filtro |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 | 34,673,635       28.22       28.22
          1 | 88,192,253       71.78      100.00
------------+-----------------------------------
      Total |122,865,888      100.00


1. Definir el esquema de muestreo


Para poder analizar las encuestas se debe definir el esquema de muestro, para ello se utiliza la función svy cuya documentación completa se encuentra en: STATA SURVEY DATA REFERENCE MANUAL. Además, para identificar las variables de diseño se debe consultar la descripción de archivos de la ENOE.
svyset upm [pw=fac], strata(est_d) vce(linearized)
      pweight: fac
          VCE: linearized
  Single unit: missing
     Strata 1: est_d
         SU 1: upm
        FPC 1: 

1.1. Totales


svy, subpop(filtro): tab clase2, format(%11.3g) count se cv ci level(90)

Number of strata   =       446                 Number of obs      =     403865
Number of PSUs     =     18440                 Population size    =  122865888
                                               Subpop. no. of obs =     291231
                                               Subpop. size       =   88192253
                                               Design df          =      17994

----------------------------------------------------------------------
   clase2 |      count          se          cv          lb          ub
----------+-----------------------------------------------------------
        0 |          0           0                       0           0
        1 |   50336088      212129        .421    49987150    50685026
        2 |    2287633       43791        1.91     2215599     2359667
        3 |    5884296       88324         1.5     5739009     6029583
        4 |   29684236      175752        .592    29395134    29973338
          | 
    Total |   88192253                                                
----------------------------------------------------------------------
  Key:  count     =  weighted counts
        se        =  linearized standard errors of weighted counts
        cv        =  coefficients of variation of weighted counts
        lb        =  lower 90% confidence bounds for weighted counts
        ub        =  upper 90% confidence bounds for weighted counts

  Table contains a zero in the marginals.
  Statistics cannot be computed.

1.2. Promedios


gen int f2=((c_res==1 | c_res==3) & r_def==0 & (eda>=15 & eda<=97))
svy, subpop (f2): mean eda if (clase1==1), level(90)
estat cv
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =     446        Number of obs    =    178640
Number of PSUs   =   18416        Population size  =  53111321
                                  Subpop. no. obs  =    177130
                                  Subpop. size     =  52592728
                                  Design df        =     17970

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [90% Conf. Interval]
-------------+------------------------------------------------
         eda |   38.87578   .0576619      38.78093    38.97063
--------------------------------------------------------------


------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     CV (%)
-------------+----------------------------------
         eda |   38.87578   .0576619     .148324
------------------------------------------------

1.3. Proporciones


svy, subpop (filtro):prop clase1, level(90)
estat cv

Survey: Proportion estimation

Number of strata =     446       Number of obs    =     403865
Number of PSUs   =   18440       Population size  =  122865888
                                 Subpop. no. obs  =     291231
                                 Subpop. size     =   88192253
                                 Design df        =      17994

--------------------------------------------------------------
             |             Linearized
             | Proportion   Std. Err.     [90% Conf. Interval]
-------------+------------------------------------------------
clase1       |
           0 |          .  (no observations)
           1 |   .5966932   .0014983      .5942287    .5991578
           2 |   .4033068   .0014983      .4008422    .4057713
--------------------------------------------------------------


------------------------------------------------
             |             Linearized
             | Proportion   Std. Err.     CV (%)
-------------+----------------------------------
clase1       |
           0 |  (omitted)                                                     
           1 |   .5966932   .0014983     .251098
           2 |   .4033068   .0014983       .3715
------------------------------------------------


2. Subpoblaciones (Problemas)


Cuando se analizan poblaciones muy pequeñas se pueden presentar problemas. En particular al momento de calcular el coeficiente de variación aparece la siguiente leyenda: Note: missing standard errors because of stratum with single sampling unit

gen ti=((c_res==1 | c_res==3) & r_def==0 & (eda>=12 & eda<15) & clase2==1)
tab ti [fw=fac]
svy, subpop (ti): tab rama if (sex==2 & eda==14), format(%11.3g) count se cv ci level(90)
         ti |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |122,390,514       99.61       99.61
          1 |    475,374        0.39      100.00
------------+-----------------------------------
      Total |122,865,888      100.00

> evel(90)
(running tabulate on estimation sample)

Number of strata   =       120                  Number of obs      =      1823
Number of PSUs     =      1551                  Population size    =    539903
                                                Subpop. no. of obs =       175
                                                Subpop. size       =     54411
                                                Design df          =      1431

----------------------
     rama |      count
----------+-----------
        0 |          0
        2 |       8916
        3 |      21301
        4 |      17608
        6 |       5109
        7 |       1477
          | 
    Total |      54411
----------------------
  Key:  count     =  weighted counts

  Table contains a zero in the marginals.
  Statistics cannot be computed.

Note: 284 strata omitted because they contain no subpopulation members.
Note: missing standard errors because of stratum with single sampling unit.


La solución es crear "pseudoestratos". Para ello, Stata tiene varios métidos como: missing, certainty, scaled, o centered.

svyset, clear
svyset upm [pw=fac], strata(est_d) vce(linearized) single(sca)
svy, subpop (ti): tab rama if (sex==2 & eda==14), format(%11.3g) count se cv ci level(90)
      pweight: fac
          VCE: linearized
  Single unit: scaled
     Strata 1: est_d
         SU 1: upm
        FPC 1: 

> evel(90)
(running tabulate on estimation sample)

Number of strata   =       120                  Number of obs      =      1823
Number of PSUs     =      1551                  Population size    =    539903
                                                Subpop. no. of obs =       175
                                                Subpop. size       =     54411
                                                Design df          =      1431

----------------------------------------------------------------------
     rama |      count          se          cv          lb          ub
----------+-----------------------------------------------------------
        0 |          0           0                       0           0
        2 |       8916        2178        24.4        5331       12501
        3 |      21301        3759        17.6       15114       27488
        4 |      17608        3503        19.9       11842       23374
        6 |       5109        1653        32.3        2389        7829
        7 |       1477         976        66.1        -130        3084
          | 
    Total |      54411                                                
----------------------------------------------------------------------
  Key:  count     =  weighted counts
        se        =  linearized standard errors of weighted counts
        cv        =  coefficients of variation of weighted counts
        lb        =  lower 90% confidence bounds for weighted counts
        ub        =  upper 90% confidence bounds for weighted counts

  Table contains a zero in the marginals.
  Statistics cannot be computed.

Note: 284 strata omitted because they contain no subpopulation members.
Note: variance scaled to handle strata with a single sampling unit.


3. Pruebas de hipótesis


Cuando se comparan datos de encuestas es necesario validar si el cambio en los estimadores es debido al fenómeno, o bien, se debe a las muestras que se utilizaron para medir el fenómeno. Por lo general este procedimiento requiere que las muestras sean independientes, sin embargo, la ENOE es una encuesta panel lo que significa que el 20% de la muestra es similar de un año a otro. Por este motivo se requiere un ajuste a los factores de expansión

3.1. Estimación 215


svyset, clear
svyset upm [pw=peso], strata(est_d) vce(linearized) single(sca)
svy, subpop(filtro): tab clase2, format(%11.3g) count se cv ci level(90)
      pweight: peso
          VCE: linearized
  Single unit: scaled
     Strata 1: est_d
         SU 1: upm
        FPC 1: 

(running tabulate on estimation sample)

Number of strata   =       446                 Number of obs      =     403865
Number of PSUs     =     18440                 Population size    =  122742516
                                               Subpop. no. of obs =     233328
                                               Subpop. size       =   88214725
                                               Design df          =      17994

----------------------------------------------------------------------
   clase2 |      count          se          cv          lb          ub
----------+-----------------------------------------------------------
        0 |          0           0                       0           0
        1 |   50473590      437163        .866    49754484    51192696
        2 |    2323904       52983        2.28     2236751     2411057
        3 |    5809150      109485        1.88     5629053     5989247
        4 |   29608081      300994        1.02    29112964    30103198
          | 
    Total |   88214725                                                
----------------------------------------------------------------------
  Key:  count     =  weighted counts
        se        =  linearized standard errors of weighted counts
        cv        =  coefficients of variation of weighted counts
        lb        =  lower 90% confidence bounds for weighted counts
        ub        =  upper 90% confidence bounds for weighted counts

  Table contains a zero in the marginals.
  Statistics cannot be computed.

3.2. Estimación 214


svyset, clear
use "C:\Users\JC\Desktop\Estadística\BUAP\sdemt214.dta", clear
svyset upm [pw=peso], strata(est_d) vce(linearized) single(sca)
gen filtro=((c_res==1 | c_res==3) & r_def==0 & (eda>=15 & eda<=98))
svy, subpop(filtro): tab clase2, format(%11.3g) count se cv ci level(90)
      pweight: peso
          VCE: linearized
  Single unit: scaled
     Strata 1: est_d
         SU 1: upm
        FPC 1: 


(running tabulate on estimation sample)

Number of strata   =       446                 Number of obs      =     406088
Number of PSUs     =     18438                 Population size    =  122211594
                                               Subpop. no. of obs =     231470
                                               Subpop. size       =   86670766
                                               Design df          =      17992

----------------------------------------------------------------------
   clase2 |      count          se          cv          lb          ub
----------+-----------------------------------------------------------
        0 |          0           0                       0           0
        1 |   49178167      434500        .884    48463441    49892893
        2 |    2502742       54858        2.19     2412504     2592980
        3 |    5798316      109008        1.88     5619005     5977627
        4 |   29191541      300429        1.03    28697355    29685727
          | 
    Total |   86670766                                                
----------------------------------------------------------------------
  Key:  count     =  weighted counts
        se        =  linearized standard errors of weighted counts
        cv        =  coefficients of variation of weighted counts
        lb        =  lower 90% confidence bounds for weighted counts
        ub        =  upper 90% confidence bounds for weighted counts

  Table contains a zero in the marginals.
  Statistics cannot be computed.


4. Modelos de regresión


svyset, clear
use "C:\Users\JC\Desktop\Estadística\BUAP\sdemt215.dta", clear
svyset upm [pw=fac], strata(est_d) vce(linearized) single(sca)
      pweight: fac
          VCE: linearized
  Single unit: scaled
     Strata 1: est_d
         SU 1: upm
        FPC 1: 


Variables ordinales

gen int ocupado=(clase2==1)
gen int sexo = round(sex)
gen int nivel = round(niv_ins)
gen int econ = round(e_con)
gen int edad7 = round(eda7c)
(7072 missing values generated)


(89944 missing values generated)

4.1. Muestro Aleatorio Simple


logit ocupado anios_esc i.edad7 i.sexo i.econ    if ((c_res==1 | c_res==3) & r_def==0 & (eda>=15 & eda<=97))
logit, or
estimates store modelo_1

Iteration 0:   log likelihood = -197836.53  
Iteration 1:   log likelihood = -160326.33  
Iteration 2:   log likelihood = -159937.61  
Iteration 3:   log likelihood = -159936.41  
Iteration 4:   log likelihood = -159936.41  

Logistic regression                               Number of obs   =     291072
                                                  LR chi2(13)     =   75800.24
                                                  Prob > chi2     =     0.0000
Log likelihood = -159936.41                       Pseudo R2       =     0.1916

------------------------------------------------------------------------------
     ocupado |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   anios_esc |   .0319044   .0009674    32.98   0.000     .0300084    .0338004
             |
       edad7 |
          2  |   1.681314   .0163315   102.95   0.000     1.649305    1.713323
          3  |   2.373402   .0186586   127.20   0.000     2.336832    2.409973
          4  |   2.448655   .0193802   126.35   0.000     2.410671     2.48664
          5  |   1.964312   .0199039    98.69   0.000     1.925301    2.003323
          6  |   .5258125    .020226    26.00   0.000     .4861703    .5654547
             |
      2.sexo |  -1.616895   .0093264  -173.37   0.000    -1.635175   -1.598616
             |
        econ |
          2  |    .527701   .0258567    20.41   0.000     .4770228    .5783793
          3  |   .5518173   .0371811    14.84   0.000     .4789438    .6246908
          4  |  -.1593692   .0249566    -6.39   0.000    -.2082832   -.1104553
          5  |     -.2012   .0137958   -14.58   0.000    -.2282392   -.1741608
          6  |  -.0908275    .014951    -6.08   0.000    -.1201309   -.0615241
          9  |  -1.845994     .58981    -3.13   0.002    -3.002001   -.6899878
             |
       _cons |  -.5598089   .0201312   -27.81   0.000    -.5992653   -.5203525
------------------------------------------------------------------------------


Logistic regression                               Number of obs   =     291072
                                                  LR chi2(13)     =   75800.24
                                                  Prob > chi2     =     0.0000
Log likelihood = -159936.41                       Pseudo R2       =     0.1916

------------------------------------------------------------------------------
     ocupado | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   anios_esc |   1.032419   .0009987    32.98   0.000     1.030463    1.034378
             |
       edad7 |
          2  |   5.372611   .0877427   102.95   0.000     5.203362    5.547366
          3  |   10.73385   .2002787   127.20   0.000      10.3484    11.13366
          4  |   11.57277   .2242831   126.35   0.000     11.14143    12.02082
          5  |   7.130003   .1419151    98.69   0.000     6.857211    7.413649
          6  |   1.691833    .034219    26.00   0.000     1.626077    1.760248
             |
      2.sexo |   .1985141   .0018514  -173.37   0.000     .1949183    .2021762
             |
        econ |
          2  |   1.695031    .043828    20.41   0.000      1.61127    1.783146
          3  |   1.736406   .0645614    14.84   0.000     1.614368    1.867668
          4  |   .8526815     .02128    -6.39   0.000      .811977    .8954264
          5  |   .8177489   .0112815   -14.58   0.000     .7959338    .8401618
          6  |   .9131752   .0136529    -6.08   0.000     .8868043    .9403303
          9  |   .1578683   .0931123    -3.13   0.002     .0496876    .5015822
             |
       _cons |   .5713182   .0115013   -27.81   0.000      .549215     .594311
------------------------------------------------------------------------------

4.2. Muestro Estratificado y por Conglomerados


gen f3=((c_res==1 | c_res==3) & r_def==0 & (eda>=15 & eda<=98) & (eda>=15 & eda<=97))
svy, subpop(f3): logit ocupado anios_esc i.edad7 i.sexo i.econ 
logit, or
estimates store modelo_2

(running logit on estimation sample)

Survey: Logistic regression

Number of strata   =       446                 Number of obs      =     403865
Number of PSUs     =     18440                 Population size    =  122865888
                                               Subpop. no. of obs =     291072
                                               Subpop. size       =   88140066
                                               Design df          =      17994
                                               F(  13,  17982)    =    1695.87
                                               Prob > F           =     0.0000

------------------------------------------------------------------------------
             |             Linearized
     ocupado |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   anios_esc |    .028724    .002202    13.04   0.000     .0244079      .03304
             |
       edad7 |
          2  |   1.675739   .0275872    60.74   0.000     1.621666    1.729813
          3  |   2.347418    .030226    77.66   0.000     2.288172    2.406663
          4  |   2.424468   .0311733    77.77   0.000     2.363365     2.48557
          5  |   2.007155   .0329516    60.91   0.000     1.942567    2.071743
          6  |   .6043582   .0346322    17.45   0.000     .5364758    .6722406
             |
      2.sexo |  -1.752686   .0152963  -114.58   0.000    -1.782668   -1.722704
             |
        econ |
          2  |   .5827273   .0384924    15.14   0.000     .5072785     .658176
          3  |   .5755434   .0588421     9.78   0.000     .4602072    .6908797
          4  |  -.1326115   .0362316    -3.66   0.000     -.203629   -.0615941
          5  |  -.1801903   .0196649    -9.16   0.000    -.2187353   -.1416453
          6  |  -.0166378    .023152    -0.72   0.472     -.062018    .0287424
          9  |  -1.815822   1.500444    -1.21   0.226    -4.756835    1.125192
             |
       _cons |    -.52711   .0347248   -15.18   0.000    -.5951739   -.4590461
------------------------------------------------------------------------------


Survey: Logistic regression

Number of strata   =       446                 Number of obs      =     403865
Number of PSUs     =     18440                 Population size    =  122865888
                                               Subpop. no. of obs =     291072
                                               Subpop. size       =   88140066
                                               Design df          =      17994
                                               F(  13,  17982)    =    1695.87
                                               Prob > F           =     0.0000

------------------------------------------------------------------------------
             |             Linearized
     ocupado | Odds Ratio   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   anios_esc |    1.02914   .0022661    13.04   0.000     1.024708    1.033592
             |
       edad7 |
          2  |   5.342743   .1473915    60.74   0.000     5.061514    5.639598
          3  |   10.45853   .3161192    77.66   0.000     9.856901    11.09687
          4  |   11.29622   .3521401    77.77   0.000     10.62665    12.00797
          5  |   7.442114   .2452298    60.91   0.000     6.976634    7.938651
          6  |   1.830077   .0633795    17.45   0.000      1.70997    1.958621
             |
      2.sexo |   .1733078    .002651  -114.58   0.000     .1681888    .1785826
             |
        econ |
          2  |   1.790916   .0689366    15.14   0.000     1.660765    1.931267
          3  |   1.778097    .104627     9.78   0.000     1.584402     1.99547
          4  |   .8758052   .0317318    -3.66   0.000      .815765    .9402645
          5  |   .8351112   .0164223    -9.16   0.000     .8035344     .867929
          6  |   .9834998     .02277    -0.72   0.472     .9398659    1.029159
          9  |   .1627041   .2441284    -1.21   0.226     .0085928    3.080807
             |
       _cons |   .5903085   .0204983   -15.18   0.000     .5514667    .6318861
------------------------------------------------------------------------------

4.3. Comparación entre modelos


   estimates table modelo_1 modelo_2,  b(%6.2f) star(0.05 0.01 .001) eform
    Variable |  modelo_1     modelo_2   
-------------+--------------------------
   anios_esc |    1.03***      1.03***  
             |
       edad7 |
          2  |    5.37***      5.34***  
          3  |   10.73***     10.46***  
          4  |   11.57***     11.30***  
          5  |    7.13***      7.44***  
          6  |    1.69***      1.83***  
             |
        sexo |
          2  |    0.20***      0.17***  
             |
        econ |
          2  |    1.70***      1.79***  
          3  |    1.74***      1.78***  
          4  |    0.85***      0.88***  
          5  |    0.82***      0.84***  
          6  |    0.91***      0.98     
          9  |    0.16**       0.16     
             |
       _cons |    0.57***      0.59***  
----------------------------------------
   legend: * p<.05; ** p<.01; *** p<.001