LAT4072_ENOE

R Markdown

In this session, we will learn how to use the ENOE, a really important survey collecting information on employment and occupations.

This survey is the one you will have to use for your final project.

setwd("~/Downloads/enoe_n_2021_trim1_csv")
sdem<-read.csv("ENOEN_SDEMT121.csv")

Remember, my advice is to always start with a basic analisis of the dataset. We can use the “names” command to explore the variables in our dataset

names(sdem)

##   [1] "r_def"      "loc"        "mun"        "est"        "est_d_tri" 
##   [6] "est_d_men"  "ageb"       "t_loc_tri"  "t_loc_men"  "cd_a"      
##  [11] "ent"        "con"        "upm"        "d_sem"      "n_pro_viv" 
##  [16] "v_sel"      "n_hog"      "h_mud"      "n_ent"      "per"       
##  [21] "n_ren"      "c_res"      "par_c"      "sex"        "eda"       
##  [26] "nac_dia"    "nac_mes"    "nac_anio"   "l_nac_c"    "cs_p12"    
##  [31] "cs_p13_1"   "cs_p13_2"   "cs_p14_c"   "cs_p15"     "cs_p16"    
##  [36] "cs_p17"     "n_hij"      "e_con"      "cs_ad_mot"  "cs_p20_des"
##  [41] "cs_ad_des"  "cs_nr_mot"  "cs_p22_des" "cs_nr_ori"  "ur"        
##  [46] "zona"       "salario"    "fac_tri"    "fac_men"    "clase1"    
##  [51] "clase2"     "clase3"     "pos_ocu"    "seg_soc"    "rama"      
##  [56] "c_ocu11c"   "ing7c"      "dur9c"      "emple7c"    "medica5c"  
##  [61] "buscar5c"   "rama_est1"  "rama_est2"  "dur_est"    "ambito1"   
##  [66] "ambito2"    "tue1"       "tue2"       "tue3"       "busqueda"  
##  [71] "d_ant_lab"  "d_cexp_est" "dur_des"    "sub_o"      "s_clasifi" 
##  [76] "remune2c"   "pre_asa"    "tip_con"    "dispo"      "nodispo"   
##  [81] "c_inac5c"   "pnea_est"   "niv_ins"    "eda5c"      "eda7c"     
##  [86] "eda12c"     "eda19c"     "hij5c"      "domestico"  "anios_esc" 
##  [91] "hrsocup"    "ingocup"    "ing_x_hrs"  "tpg_p8a"    "tcco"      
##  [96] "cp_anoc"    "imssissste" "ma48me1sm"  "p14apoyos"  "scian"     
## [101] "t_tra"      "emp_ppal"   "tue_ppal"   "trans_ppal" "mh_fil2"   
## [106] "mh_col"     "sec_ins"    "tipo"       "mes_cal"    "ca"

But, what is the meaning of all these variables. INEGI is really good at providing information. We have several options, but the things we should look at are: - File descriptor - Questionnaire - Additional resources

If we go to the file discriptor, we will find that the dataset is split into different files. The one I openned was the sociodemographic file. It cointains many of the basic variables of interest.

The first variables we see are related to the methodology. And are used to control for the survey design (we will talk about this later), but we will also find variables that will allow us to merge different datasets.

We can start with some basic procedures:

Summary of age
Tabulate sex
Cross tab age and sex

summary(sdem$eda)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   16.00   31.00   33.57   49.00   99.00    6876

table(sdem$sex)

## 
##      1      2 
## 165625 178227

table(sdem$eda, sdem$sex)

##     
##         1    2
##   0  1658 1702
##   1  2053 1994
##   2  2402 2211
##   3  2444 2415
##   4  2577 2544
##   5  2676 2532
##   6  2744 2683
##   7  2814 2671
##   8  2873 2807
##   9  2964 2765
##   10 2994 2880
##   11 2991 2906
##   12 3125 2850
##   13 3043 2840
##   14 3008 2858
##   15 3113 2919
##   16 3028 2998
##   17 2982 2901
##   18 3102 3057
##   19 3088 2935
##   20 3069 3054
##   21 2916 2920
##   22 2851 2841
##   23 2729 2782
##   24 2685 2699
##   25 2673 2783
##   26 2548 2832
##   27 2497 2638
##   28 2551 2666
##   29 2288 2436
##   30 2679 2867
##   31 2273 2533
##   32 2206 2544
##   33 2285 2530
##   34 2160 2425
##   35 2215 2615
##   36 2275 2522
##   37 2039 2391
##   38 2212 2447
##   39 2133 2461
##   40 2413 2766
##   41 2265 2413
##   42 2068 2445
##   43 2172 2538
##   44 1975 2331
##   45 2211 2509
##   46 2275 2507
##   47 2080 2443
##   48 2111 2469
##   49 2036 2451
##   50 2280 2545
##   51 2033 2267
##   52 1849 2248
##   53 1823 2198
##   54 1807 2043
##   55 1809 2102
##   56 1738 2096
##   57 1586 1927
##   58 1506 1767
##   59 1391 1728
##   60 1678 1992
##   61 1482 1736
##   62 1255 1432
##   63 1330 1596
##   64 1257 1439
##   65 1230 1443
##   66 1144 1324
##   67  936 1132
##   68  969 1199
##   69  834 1038
##   70  931 1107
##   71  766  953
##   72  714  864
##   73  675  906
##   74  595  715
##   75  661  818
##   76  594  707
##   77  454  581
##   78  505  590
##   79  399  474
##   80  470  527
##   81  337  436
##   82  243  337
##   83  249  341
##   84  225  337
##   85  240  309
##   86  179  262
##   87  146  234
##   88  126  184
##   89   85  160
##   90   97  166
##   91   62  105
##   92   50   72
##   93   62   82
##   94   42   59
##   95   26   55
##   96   20   48
##   97   56  109
##   98   90   97
##   99   20   14

Now, as I told you yesterday, this is the survey that we use to measure labor force participation. This information in in the “clase1” variable:

table(sdem$clase1)

## 
##      0      1      2 
##  68567 155794 126367

Go to the file descriptor and find out the meaning for 0,1,2…

We can then, restrict our analysis to those 12 years and older. We will also drop all observations which missing information, which are the members of the household that moved out.

sdem<-sdem[which(sdem$eda>=15 & sdem$eda<=97),]
sdem<-sdem[which(sdem$c_res!=2),]
table(sdem$clase1,sdem$sex)

##    
##         1     2
##   0   173   172
##   1 92483 62143
##   2 32493 77143

Finally, there are some problems with the data, those zeroes shoulndt be there, lets remove them from the dataset:

sdem<-sdem[which(sdem$clase1>0),]
table(sdem$clase1,sdem$sex)

##    
##         1     2
##   1 92483 62143
##   2 32493 77143

Now, the table we just created represent the surveyed population by gender and labor force status. But wait… shouldnt we have more people?

This is a survey, so we only have a sample of our population, we could estimate proportions, assuming that sampling was made following Single Random Sample (MAS in Spanish)

prop.table(table(sdem$clase1,sdem$sex))

##    
##             1         2
##   1 0.3499671 0.2351568
##   2 0.1229575 0.2919186

This command is providing the proportions from the total, but what if we just want propirtions by row or column? We can fix that by adding the “margins” option.

prop.table(table(sdem$clase1,sdem$sex),margin=1) # by row

##    
##             1         2
##   1 0.5981077 0.4018923
##   2 0.2963716 0.7036284

prop.table(table(sdem$clase1,sdem$sex),margin=2) # by column

##    
##             1         2
##   1 0.7400061 0.4461540
##   2 0.2599939 0.5538460

And if we want percentages?

prop.table(table(sdem$clase1,sdem$sex),margin=1)*100 # by row

##    
##            1        2
##   1 59.81077 40.18923
##   2 29.63716 70.36284

prop.table(table(sdem$clase1,sdem$sex),margin=2)*100 # by column

##    
##            1        2
##   1 74.00061 44.61540
##   2 25.99939 55.38460

Now, is there a difference in this distribution? Here, we probably wont need any statistical test to see if the differences are significant, but what if we really want to see that?

chisq.test(table(sdem$clase1,sdem$sex))

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(sdem$clase1, sdem$sex)
## X-squared = 23430, df = 1, p-value < 2.2e-16

We can explore more variables, now we can go with clase 2:

prop.table(table(sdem$clase2,sdem$sex),margin=2)*100

##    
##             1         2
##   1 70.644764 42.751605
##   2  3.355844  1.863791
##   3  6.937332  9.114340
##   4 19.062060 46.270264

This distribution is missleading, we usually want the PEA and PNEA, and from those, indentify the percentages in occupation or not, as well as those who are availale to work and thos who arent.

The best way to do this is to create subsets and then get the proportions:

pea<- sdem[which(sdem$clase1==1),]
prop.table(table(pea$clase2,pea$sex),margin=2)*100

##    
##             1         2
##   1 95.465113 95.822538
##   2  4.534887  4.177462

chisq.test(table(pea$clase2,pea$sex))

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(pea$clase2, pea$sex)
## X-squared = 11.225, df = 1, p-value = 0.0008071

remove(pea)

pnea<- sdem[which(sdem$clase1==2),]
prop.table(table(pnea$clase2,pnea$sex),margin=2)*100

##    
##            1        2
##   3 26.68267 16.45645
##   4 73.31733 83.54355

chisq.test(table(pnea$clase2,pnea$sex))

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(pnea$clase2, pnea$sex)
## X-squared = 1523.2, df = 1, p-value < 2.2e-16

remove(pnea)

Now, lets explore the population by condition of occupation. We have some basic groups: - Trabajadores subordinados y remunerados (paid employees) - Empleadore (employers) - Trabajadores por cuenta propia (independent workers) - Trabajadores sin pago (unpaid workers)

peao<-sdem[which(sdem$clase2==1),]
table(peao$pos_ocu)

## 
##      1      2      3      4 
## 104674   7284  30775   5103

But, what if we want the names of the categories instead if numbers?

peao$pos_ocu<-factor(peao$pos_ocu, 
            levels=c(1,2,3,4),
            labels=c("Employees","Employers","Independent","Unpaid"))
table(peao$pos_ocu)

## 
##   Employees   Employers Independent      Unpaid 
##      104674        7284       30775        5103

table(peao$pos_ocu,peao$sex)

##              
##                   1     2
##   Employees   62752 41922
##   Employers    5666  1618
##   Independent 17798 12977
##   Unpaid       2073  3030

Now, we can do the same with sex:

peao$sex<-factor(peao$sex,
                 levels=c(1,2),
                 labels=c("Men","Women"))
table(peao$pos_ocu,peao$sex)

##              
##                 Men Women
##   Employees   62752 41922
##   Employers    5666  1618
##   Independent 17798 12977
##   Unpaid       2073  3030

prop.table(table(peao$pos_ocu,peao$sex),margin=2)*100

##              
##                     Men     Women
##   Employees   71.075672 70.401532
##   Employers    6.417561  2.717181
##   Independent 20.158797 21.792869
##   Unpaid       2.347971  5.088418

chisq.test(table(peao$pos_ocu,peao$sex))

## 
##  Pearson's Chi-squared test
## 
## data:  table(peao$pos_ocu, peao$sex)
## X-squared = 1809.9, df = 3, p-value < 2.2e-16

Now, we can explore access to Social Security:

peao$seg_soc<-factor(peao$seg_soc,
                     levels=c(1,2,3),
                     labels=c("With","Without","Unsp"))
prop.table(table(peao$seg_soc,peao$sex),margin=2)*100

##          
##                  Men      Women
##   With    42.6825539 44.5614389
##   Without 56.5902887 54.8222413
##   Unsp     0.7271574  0.6163199

Now, explore income level:

peao$ing7c<-factor(peao$ing7c,
                   levels=c(1,2,3,4,5,6,7),
                   labels=c("1SM","1to2SM","2to3SM","3to5SM",
                            "More5SM","NoIncome","Unsp"))
prop.table(table(peao$ing7c,peao$sex),margin=2)*100

##           
##                  Men     Women
##   1SM      18.579891 30.669891
##   1to2SM   37.986612 34.512234
##   2to3SM   15.803781 10.862008
##   3to5SM    7.655540  5.362151
##   More5SM   2.923354  1.649118
##   NoIncome  3.480615  5.180782
##   Unsp     13.570207 11.763817

The survey also has information in continuous version:

summary(peao$hrsocup)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   30.00   44.00   40.28   50.00  168.00

summary(peao$ingocup)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0    4500    5425    7740  245100

summary(peao$ing_x_hrs)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00   24.92   34.00   41.67 6201.55

If we want to obtain indicators by gender, we have to use a library… well… we don’t have to… but it is the easier way:

library(doBy)

## Registered S3 methods overwritten by 'tibble':
##   method     from  
##   format.tbl pillar
##   print.tbl  pillar

summaryBy(hrsocup~sex,data=peao,FUN=mean)

##     sex hrsocup.mean
## 1   Men     43.33586
## 2 Women     35.75433

summaryBy(ingocup~sex,data=peao,FUN=mean)

##     sex ingocup.mean
## 1   Men     5901.897
## 2 Women     4718.569

summaryBy(ing_x_hrs~sex,data=peao,FUN=mean)

##     sex ing_x_hrs.mean
## 1   Men       33.79477
## 2 Women       34.30119

And if we want to see if these values are in fact, different:

t.test(hrsocup~sex,data=peao)

## 
##  Welch Two Sample t-test
## 
## data:  hrsocup by sex
## t = 76.248, df = 127292, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  7.386649 7.776420
## sample estimates:
##   mean in group Men mean in group Women 
##            43.33586            35.75433

t.test(ingocup~sex,data=peao)

## 
##  Welch Two Sample t-test
## 
## data:  ingocup by sex
## t = 35.636, df = 144053, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1118.245 1248.411
## sample estimates:
##   mean in group Men mean in group Women 
##            5901.897            4718.569

t.test(ing_x_hrs~sex,data=peao)

## 
##  Welch Two Sample t-test
## 
## data:  ing_x_hrs by sex
## t = -1.4243, df = 129399, p-value = 0.1544
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.2033296  0.1904894
## sample estimates:
##   mean in group Men mean in group Women 
##            33.79477            34.30119

Now, this is a survey, a representative survey… which means that we should be capable to estimate information for the country.

If we just go with a table:

table(peao$pos_ocu,peao$sex)

##              
##                 Men Women
##   Employees   62752 41922
##   Employers    5666  1618
##   Independent 17798 12977
##   Unpaid       2073  3030

We are obtaining information for the sample, we can’t assume this is the same for the country. For instance, if we estimated the proportions, can we say these proportions are the same at the national level?

If were on the context of SRS, we could use these percentages. H However, INEGI uses a more complex sampling approach.

The first key varible is factor.But we need a new library:

library(questionr)
wtd.table(peao$pos_ocu,peao$sex,weights=peao$fac_tri)

##                  Men    Women
## Employees   22421744 13905588
## Employers    2046667   544798
## Independent  7264378  4680928
## Unpaid        917523  1157894

We have some problems to perform, basi t-tests in weighted mode.

We see this better if we use some more interesting approach.

Lets fit a linear regression to income per hour, we will use as predictors the variables sex, eda, pos_ocu… The first model is easy:

model1<-lm(ing_x_hrs~sex+eda+pos_ocu,data=peao)
summary(model1)

## 
## Call:
## lm(formula = ing_x_hrs ~ sex + eda + pos_ocu, data = peao)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -53.1  -32.9   -8.5    7.0 6166.5 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         31.17645    0.55258  56.420  < 2e-16 ***
## sexWomen             2.01327    0.35700   5.639 1.71e-08 ***
## eda                  0.05898    0.01304   4.524 6.07e-06 ***
## pos_ocuEmployers    14.81179    0.82195  18.020  < 2e-16 ***
## pos_ocuIndependent   0.57878    0.44827   1.291    0.197    
## pos_ocuUnpaid      -34.35715    0.96309 -35.674  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.91 on 147830 degrees of freedom
## Multiple R-squared:  0.01176,    Adjusted R-squared:  0.01172 
## F-statistic: 351.8 on 5 and 147830 DF,  p-value: < 2.2e-16

Now, we will use the weighted approach:

model1wt<-lm(ing_x_hrs~sex+eda+pos_ocu,data=peao,
             weights=fac_tri)
summary(model1wt)

## 
## Call:
## lm(formula = ing_x_hrs ~ sex + eda + pos_ocu, data = peao, weights = fac_tri)
## 
## Weighted Residuals:
##    Min     1Q Median     3Q    Max 
##  -6125   -339    -72    137 174854 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         29.56213    0.50534  58.499  < 2e-16 ***
## sexWomen             3.15277    0.33008   9.552  < 2e-16 ***
## eda                  0.02211    0.01193   1.853 0.063845 .  
## pos_ocuEmployers    12.92165    0.75804  17.046  < 2e-16 ***
## pos_ocuIndependent  -1.37278    0.40113  -3.422 0.000621 ***
## pos_ocuUnpaid      -32.04295    0.83270 -38.481  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1161 on 147830 degrees of freedom
## Multiple R-squared:  0.01281,    Adjusted R-squared:  0.01278 
## F-statistic: 383.7 on 5 and 147830 DF,  p-value: < 2.2e-16

Now, the most appopriate way to do this:

library(survey)

## Loading required package: grid

## Loading required package: Matrix

## Loading required package: survival

## 
## Attaching package: 'survey'

## The following object is masked from 'package:graphics':
## 
##     dotchart

design<-svydesign(id=~upm,weights=~fac_tri,strata=~est_d_tri,data=peao,
                  lonely.psu="certainty")
options(survey.lonely.psu="adjust")
model1d<-svyglm(ing_x_hrs~sex+eda+pos_ocu,design=design,family=gaussian)
summary(model1d)

## 
## Call:
## svyglm(formula = ing_x_hrs ~ sex + eda + pos_ocu, design = design, 
##     family = gaussian)
## 
## Survey design:
## svydesign(id = ~upm, weights = ~fac_tri, strata = ~est_d_tri, 
##     data = peao, lonely.psu = "certainty")
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         29.56213    0.71644  41.262  < 2e-16 ***
## sexWomen             3.15277    0.58766   5.365  8.2e-08 ***
## eda                  0.02211    0.01885   1.173   0.2408    
## pos_ocuEmployers    12.92165    1.44152   8.964  < 2e-16 ***
## pos_ocuIndependent  -1.37278    0.81272  -1.689   0.0912 .  
## pos_ocuUnpaid      -32.04295    0.38757 -82.676  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 3763.835)
## 
## Number of Fisher Scoring iterations: 2

LAT4072_ENOE

Mauricio Rodriguez Abreu

3/4/2022

R Markdown