In this session, we will learn how to use the ENOE, a really important survey collecting information on employment and occupations.
This survey is the one you will have to use for your final project.
setwd("~/Downloads/enoe_n_2021_trim1_csv")
sdem<-read.csv("ENOEN_SDEMT121.csv")
Remember, my advice is to always start with a basic analisis of the dataset. We can use the “names” command to explore the variables in our dataset
names(sdem)
## [1] "r_def" "loc" "mun" "est" "est_d_tri"
## [6] "est_d_men" "ageb" "t_loc_tri" "t_loc_men" "cd_a"
## [11] "ent" "con" "upm" "d_sem" "n_pro_viv"
## [16] "v_sel" "n_hog" "h_mud" "n_ent" "per"
## [21] "n_ren" "c_res" "par_c" "sex" "eda"
## [26] "nac_dia" "nac_mes" "nac_anio" "l_nac_c" "cs_p12"
## [31] "cs_p13_1" "cs_p13_2" "cs_p14_c" "cs_p15" "cs_p16"
## [36] "cs_p17" "n_hij" "e_con" "cs_ad_mot" "cs_p20_des"
## [41] "cs_ad_des" "cs_nr_mot" "cs_p22_des" "cs_nr_ori" "ur"
## [46] "zona" "salario" "fac_tri" "fac_men" "clase1"
## [51] "clase2" "clase3" "pos_ocu" "seg_soc" "rama"
## [56] "c_ocu11c" "ing7c" "dur9c" "emple7c" "medica5c"
## [61] "buscar5c" "rama_est1" "rama_est2" "dur_est" "ambito1"
## [66] "ambito2" "tue1" "tue2" "tue3" "busqueda"
## [71] "d_ant_lab" "d_cexp_est" "dur_des" "sub_o" "s_clasifi"
## [76] "remune2c" "pre_asa" "tip_con" "dispo" "nodispo"
## [81] "c_inac5c" "pnea_est" "niv_ins" "eda5c" "eda7c"
## [86] "eda12c" "eda19c" "hij5c" "domestico" "anios_esc"
## [91] "hrsocup" "ingocup" "ing_x_hrs" "tpg_p8a" "tcco"
## [96] "cp_anoc" "imssissste" "ma48me1sm" "p14apoyos" "scian"
## [101] "t_tra" "emp_ppal" "tue_ppal" "trans_ppal" "mh_fil2"
## [106] "mh_col" "sec_ins" "tipo" "mes_cal" "ca"
But, what is the meaning of all these variables. INEGI is really good at providing information. We have several options, but the things we should look at are: - File descriptor - Questionnaire - Additional resources
If we go to the file discriptor, we will find that the dataset is split into different files. The one I openned was the sociodemographic file. It cointains many of the basic variables of interest.
The first variables we see are related to the methodology. And are used to control for the survey design (we will talk about this later), but we will also find variables that will allow us to merge different datasets.
We can start with some basic procedures:
summary(sdem$eda)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 16.00 31.00 33.57 49.00 99.00 6876
table(sdem$sex)
##
## 1 2
## 165625 178227
table(sdem$eda, sdem$sex)
##
## 1 2
## 0 1658 1702
## 1 2053 1994
## 2 2402 2211
## 3 2444 2415
## 4 2577 2544
## 5 2676 2532
## 6 2744 2683
## 7 2814 2671
## 8 2873 2807
## 9 2964 2765
## 10 2994 2880
## 11 2991 2906
## 12 3125 2850
## 13 3043 2840
## 14 3008 2858
## 15 3113 2919
## 16 3028 2998
## 17 2982 2901
## 18 3102 3057
## 19 3088 2935
## 20 3069 3054
## 21 2916 2920
## 22 2851 2841
## 23 2729 2782
## 24 2685 2699
## 25 2673 2783
## 26 2548 2832
## 27 2497 2638
## 28 2551 2666
## 29 2288 2436
## 30 2679 2867
## 31 2273 2533
## 32 2206 2544
## 33 2285 2530
## 34 2160 2425
## 35 2215 2615
## 36 2275 2522
## 37 2039 2391
## 38 2212 2447
## 39 2133 2461
## 40 2413 2766
## 41 2265 2413
## 42 2068 2445
## 43 2172 2538
## 44 1975 2331
## 45 2211 2509
## 46 2275 2507
## 47 2080 2443
## 48 2111 2469
## 49 2036 2451
## 50 2280 2545
## 51 2033 2267
## 52 1849 2248
## 53 1823 2198
## 54 1807 2043
## 55 1809 2102
## 56 1738 2096
## 57 1586 1927
## 58 1506 1767
## 59 1391 1728
## 60 1678 1992
## 61 1482 1736
## 62 1255 1432
## 63 1330 1596
## 64 1257 1439
## 65 1230 1443
## 66 1144 1324
## 67 936 1132
## 68 969 1199
## 69 834 1038
## 70 931 1107
## 71 766 953
## 72 714 864
## 73 675 906
## 74 595 715
## 75 661 818
## 76 594 707
## 77 454 581
## 78 505 590
## 79 399 474
## 80 470 527
## 81 337 436
## 82 243 337
## 83 249 341
## 84 225 337
## 85 240 309
## 86 179 262
## 87 146 234
## 88 126 184
## 89 85 160
## 90 97 166
## 91 62 105
## 92 50 72
## 93 62 82
## 94 42 59
## 95 26 55
## 96 20 48
## 97 56 109
## 98 90 97
## 99 20 14
Now, as I told you yesterday, this is the survey that we use to measure labor force participation. This information in in the “clase1” variable:
table(sdem$clase1)
##
## 0 1 2
## 68567 155794 126367
Go to the file descriptor and find out the meaning for 0,1,2…
We can then, restrict our analysis to those 12 years and older. We will also drop all observations which missing information, which are the members of the household that moved out.
sdem<-sdem[which(sdem$eda>=15 & sdem$eda<=97),]
sdem<-sdem[which(sdem$c_res!=2),]
table(sdem$clase1,sdem$sex)
##
## 1 2
## 0 173 172
## 1 92483 62143
## 2 32493 77143
Finally, there are some problems with the data, those zeroes shoulndt be there, lets remove them from the dataset:
sdem<-sdem[which(sdem$clase1>0),]
table(sdem$clase1,sdem$sex)
##
## 1 2
## 1 92483 62143
## 2 32493 77143
Now, the table we just created represent the surveyed population by gender and labor force status. But wait… shouldnt we have more people?
This is a survey, so we only have a sample of our population, we could estimate proportions, assuming that sampling was made following Single Random Sample (MAS in Spanish)
prop.table(table(sdem$clase1,sdem$sex))
##
## 1 2
## 1 0.3499671 0.2351568
## 2 0.1229575 0.2919186
This command is providing the proportions from the total, but what if we just want propirtions by row or column? We can fix that by adding the “margins” option.
prop.table(table(sdem$clase1,sdem$sex),margin=1) # by row
##
## 1 2
## 1 0.5981077 0.4018923
## 2 0.2963716 0.7036284
prop.table(table(sdem$clase1,sdem$sex),margin=2) # by column
##
## 1 2
## 1 0.7400061 0.4461540
## 2 0.2599939 0.5538460
And if we want percentages?
prop.table(table(sdem$clase1,sdem$sex),margin=1)*100 # by row
##
## 1 2
## 1 59.81077 40.18923
## 2 29.63716 70.36284
prop.table(table(sdem$clase1,sdem$sex),margin=2)*100 # by column
##
## 1 2
## 1 74.00061 44.61540
## 2 25.99939 55.38460
Now, is there a difference in this distribution? Here, we probably wont need any statistical test to see if the differences are significant, but what if we really want to see that?
chisq.test(table(sdem$clase1,sdem$sex))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(sdem$clase1, sdem$sex)
## X-squared = 23430, df = 1, p-value < 2.2e-16
We can explore more variables, now we can go with clase 2:
prop.table(table(sdem$clase2,sdem$sex),margin=2)*100
##
## 1 2
## 1 70.644764 42.751605
## 2 3.355844 1.863791
## 3 6.937332 9.114340
## 4 19.062060 46.270264
This distribution is missleading, we usually want the PEA and PNEA, and from those, indentify the percentages in occupation or not, as well as those who are availale to work and thos who arent.
The best way to do this is to create subsets and then get the proportions:
pea<- sdem[which(sdem$clase1==1),]
prop.table(table(pea$clase2,pea$sex),margin=2)*100
##
## 1 2
## 1 95.465113 95.822538
## 2 4.534887 4.177462
chisq.test(table(pea$clase2,pea$sex))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(pea$clase2, pea$sex)
## X-squared = 11.225, df = 1, p-value = 0.0008071
remove(pea)
pnea<- sdem[which(sdem$clase1==2),]
prop.table(table(pnea$clase2,pnea$sex),margin=2)*100
##
## 1 2
## 3 26.68267 16.45645
## 4 73.31733 83.54355
chisq.test(table(pnea$clase2,pnea$sex))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(pnea$clase2, pnea$sex)
## X-squared = 1523.2, df = 1, p-value < 2.2e-16
remove(pnea)
Now, lets explore the population by condition of occupation. We have some basic groups: - Trabajadores subordinados y remunerados (paid employees) - Empleadore (employers) - Trabajadores por cuenta propia (independent workers) - Trabajadores sin pago (unpaid workers)
peao<-sdem[which(sdem$clase2==1),]
table(peao$pos_ocu)
##
## 1 2 3 4
## 104674 7284 30775 5103
But, what if we want the names of the categories instead if numbers?
peao$pos_ocu<-factor(peao$pos_ocu,
levels=c(1,2,3,4),
labels=c("Employees","Employers","Independent","Unpaid"))
table(peao$pos_ocu)
##
## Employees Employers Independent Unpaid
## 104674 7284 30775 5103
table(peao$pos_ocu,peao$sex)
##
## 1 2
## Employees 62752 41922
## Employers 5666 1618
## Independent 17798 12977
## Unpaid 2073 3030
Now, we can do the same with sex:
peao$sex<-factor(peao$sex,
levels=c(1,2),
labels=c("Men","Women"))
table(peao$pos_ocu,peao$sex)
##
## Men Women
## Employees 62752 41922
## Employers 5666 1618
## Independent 17798 12977
## Unpaid 2073 3030
prop.table(table(peao$pos_ocu,peao$sex),margin=2)*100
##
## Men Women
## Employees 71.075672 70.401532
## Employers 6.417561 2.717181
## Independent 20.158797 21.792869
## Unpaid 2.347971 5.088418
chisq.test(table(peao$pos_ocu,peao$sex))
##
## Pearson's Chi-squared test
##
## data: table(peao$pos_ocu, peao$sex)
## X-squared = 1809.9, df = 3, p-value < 2.2e-16
Now, we can explore access to Social Security:
peao$seg_soc<-factor(peao$seg_soc,
levels=c(1,2,3),
labels=c("With","Without","Unsp"))
prop.table(table(peao$seg_soc,peao$sex),margin=2)*100
##
## Men Women
## With 42.6825539 44.5614389
## Without 56.5902887 54.8222413
## Unsp 0.7271574 0.6163199
Now, explore income level:
peao$ing7c<-factor(peao$ing7c,
levels=c(1,2,3,4,5,6,7),
labels=c("1SM","1to2SM","2to3SM","3to5SM",
"More5SM","NoIncome","Unsp"))
prop.table(table(peao$ing7c,peao$sex),margin=2)*100
##
## Men Women
## 1SM 18.579891 30.669891
## 1to2SM 37.986612 34.512234
## 2to3SM 15.803781 10.862008
## 3to5SM 7.655540 5.362151
## More5SM 2.923354 1.649118
## NoIncome 3.480615 5.180782
## Unsp 13.570207 11.763817
The survey also has information in continuous version:
summary(peao$hrsocup)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 30.00 44.00 40.28 50.00 168.00
summary(peao$ingocup)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 4500 5425 7740 245100
summary(peao$ing_x_hrs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 24.92 34.00 41.67 6201.55
If we want to obtain indicators by gender, we have to use a library… well… we don’t have to… but it is the easier way:
library(doBy)
## Registered S3 methods overwritten by 'tibble':
## method from
## format.tbl pillar
## print.tbl pillar
summaryBy(hrsocup~sex,data=peao,FUN=mean)
## sex hrsocup.mean
## 1 Men 43.33586
## 2 Women 35.75433
summaryBy(ingocup~sex,data=peao,FUN=mean)
## sex ingocup.mean
## 1 Men 5901.897
## 2 Women 4718.569
summaryBy(ing_x_hrs~sex,data=peao,FUN=mean)
## sex ing_x_hrs.mean
## 1 Men 33.79477
## 2 Women 34.30119
And if we want to see if these values are in fact, different:
t.test(hrsocup~sex,data=peao)
##
## Welch Two Sample t-test
##
## data: hrsocup by sex
## t = 76.248, df = 127292, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 7.386649 7.776420
## sample estimates:
## mean in group Men mean in group Women
## 43.33586 35.75433
t.test(ingocup~sex,data=peao)
##
## Welch Two Sample t-test
##
## data: ingocup by sex
## t = 35.636, df = 144053, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1118.245 1248.411
## sample estimates:
## mean in group Men mean in group Women
## 5901.897 4718.569
t.test(ing_x_hrs~sex,data=peao)
##
## Welch Two Sample t-test
##
## data: ing_x_hrs by sex
## t = -1.4243, df = 129399, p-value = 0.1544
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.2033296 0.1904894
## sample estimates:
## mean in group Men mean in group Women
## 33.79477 34.30119
Now, this is a survey, a representative survey… which means that we should be capable to estimate information for the country.
If we just go with a table:
table(peao$pos_ocu,peao$sex)
##
## Men Women
## Employees 62752 41922
## Employers 5666 1618
## Independent 17798 12977
## Unpaid 2073 3030
We are obtaining information for the sample, we can’t assume this is the same for the country. For instance, if we estimated the proportions, can we say these proportions are the same at the national level?
If were on the context of SRS, we could use these percentages. H However, INEGI uses a more complex sampling approach.
The first key varible is factor.But we need a new library:
library(questionr)
wtd.table(peao$pos_ocu,peao$sex,weights=peao$fac_tri)
## Men Women
## Employees 22421744 13905588
## Employers 2046667 544798
## Independent 7264378 4680928
## Unpaid 917523 1157894
We have some problems to perform, basi t-tests in weighted mode.
We see this better if we use some more interesting approach.
Lets fit a linear regression to income per hour, we will use as predictors the variables sex, eda, pos_ocu… The first model is easy:
model1<-lm(ing_x_hrs~sex+eda+pos_ocu,data=peao)
summary(model1)
##
## Call:
## lm(formula = ing_x_hrs ~ sex + eda + pos_ocu, data = peao)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.1 -32.9 -8.5 7.0 6166.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.17645 0.55258 56.420 < 2e-16 ***
## sexWomen 2.01327 0.35700 5.639 1.71e-08 ***
## eda 0.05898 0.01304 4.524 6.07e-06 ***
## pos_ocuEmployers 14.81179 0.82195 18.020 < 2e-16 ***
## pos_ocuIndependent 0.57878 0.44827 1.291 0.197
## pos_ocuUnpaid -34.35715 0.96309 -35.674 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.91 on 147830 degrees of freedom
## Multiple R-squared: 0.01176, Adjusted R-squared: 0.01172
## F-statistic: 351.8 on 5 and 147830 DF, p-value: < 2.2e-16
Now, we will use the weighted approach:
model1wt<-lm(ing_x_hrs~sex+eda+pos_ocu,data=peao,
weights=fac_tri)
summary(model1wt)
##
## Call:
## lm(formula = ing_x_hrs ~ sex + eda + pos_ocu, data = peao, weights = fac_tri)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -6125 -339 -72 137 174854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.56213 0.50534 58.499 < 2e-16 ***
## sexWomen 3.15277 0.33008 9.552 < 2e-16 ***
## eda 0.02211 0.01193 1.853 0.063845 .
## pos_ocuEmployers 12.92165 0.75804 17.046 < 2e-16 ***
## pos_ocuIndependent -1.37278 0.40113 -3.422 0.000621 ***
## pos_ocuUnpaid -32.04295 0.83270 -38.481 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1161 on 147830 degrees of freedom
## Multiple R-squared: 0.01281, Adjusted R-squared: 0.01278
## F-statistic: 383.7 on 5 and 147830 DF, p-value: < 2.2e-16
Now, the most appopriate way to do this:
library(survey)
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
##
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
##
## dotchart
design<-svydesign(id=~upm,weights=~fac_tri,strata=~est_d_tri,data=peao,
lonely.psu="certainty")
options(survey.lonely.psu="adjust")
model1d<-svyglm(ing_x_hrs~sex+eda+pos_ocu,design=design,family=gaussian)
summary(model1d)
##
## Call:
## svyglm(formula = ing_x_hrs ~ sex + eda + pos_ocu, design = design,
## family = gaussian)
##
## Survey design:
## svydesign(id = ~upm, weights = ~fac_tri, strata = ~est_d_tri,
## data = peao, lonely.psu = "certainty")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.56213 0.71644 41.262 < 2e-16 ***
## sexWomen 3.15277 0.58766 5.365 8.2e-08 ***
## eda 0.02211 0.01885 1.173 0.2408
## pos_ocuEmployers 12.92165 1.44152 8.964 < 2e-16 ***
## pos_ocuIndependent -1.37278 0.81272 -1.689 0.0912 .
## pos_ocuUnpaid -32.04295 0.38757 -82.676 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 3763.835)
##
## Number of Fisher Scoring iterations: 2