In this session, we will learn how to use the ENOE, a really important survey collecting information on employment and occupations.
This survey is the one you will have to use for your final project.
setwd("~/Downloads/enoe_n_2021_trim1_csv")
sdem<-read.csv("ENOEN_SDEMT121.csv")
Remember, my advice is to always start with a basic analisis of the dataset. We can use the “names” command to explore the variables in our dataset
names(sdem)
## [1] "r_def" "loc" "mun" "est" "est_d_tri"
## [6] "est_d_men" "ageb" "t_loc_tri" "t_loc_men" "cd_a"
## [11] "ent" "con" "upm" "d_sem" "n_pro_viv"
## [16] "v_sel" "n_hog" "h_mud" "n_ent" "per"
## [21] "n_ren" "c_res" "par_c" "sex" "eda"
## [26] "nac_dia" "nac_mes" "nac_anio" "l_nac_c" "cs_p12"
## [31] "cs_p13_1" "cs_p13_2" "cs_p14_c" "cs_p15" "cs_p16"
## [36] "cs_p17" "n_hij" "e_con" "cs_ad_mot" "cs_p20_des"
## [41] "cs_ad_des" "cs_nr_mot" "cs_p22_des" "cs_nr_ori" "ur"
## [46] "zona" "salario" "fac_tri" "fac_men" "clase1"
## [51] "clase2" "clase3" "pos_ocu" "seg_soc" "rama"
## [56] "c_ocu11c" "ing7c" "dur9c" "emple7c" "medica5c"
## [61] "buscar5c" "rama_est1" "rama_est2" "dur_est" "ambito1"
## [66] "ambito2" "tue1" "tue2" "tue3" "busqueda"
## [71] "d_ant_lab" "d_cexp_est" "dur_des" "sub_o" "s_clasifi"
## [76] "remune2c" "pre_asa" "tip_con" "dispo" "nodispo"
## [81] "c_inac5c" "pnea_est" "niv_ins" "eda5c" "eda7c"
## [86] "eda12c" "eda19c" "hij5c" "domestico" "anios_esc"
## [91] "hrsocup" "ingocup" "ing_x_hrs" "tpg_p8a" "tcco"
## [96] "cp_anoc" "imssissste" "ma48me1sm" "p14apoyos" "scian"
## [101] "t_tra" "emp_ppal" "tue_ppal" "trans_ppal" "mh_fil2"
## [106] "mh_col" "sec_ins" "tipo" "mes_cal" "ca"
But, what is the meaning of all these variables. INEGI is really good at providing information. We have several options, but the things we should look at are: - File descriptor - Questionnaire - Additional resources
If we go to the file discriptor, we will find that the dataset is split into different files. The one I openned was the sociodemographic file. It cointains many of the basic variables of interest.
The first variables we see are related to the methodology. And are used to control for the survey design (we will talk about this later), but we will also find variables that will allow us to merge different datasets.
We can start with some basic procedures:
summary(sdem$eda)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 16.00 31.00 33.57 49.00 99.00 6876
table(sdem$sex)
##
## 1 2
## 165625 178227
table(sdem$eda, sdem$sex)
##
## 1 2
## 0 1658 1702
## 1 2053 1994
## 2 2402 2211
## 3 2444 2415
## 4 2577 2544
## 5 2676 2532
## 6 2744 2683
## 7 2814 2671
## 8 2873 2807
## 9 2964 2765
## 10 2994 2880
## 11 2991 2906
## 12 3125 2850
## 13 3043 2840
## 14 3008 2858
## 15 3113 2919
## 16 3028 2998
## 17 2982 2901
## 18 3102 3057
## 19 3088 2935
## 20 3069 3054
## 21 2916 2920
## 22 2851 2841
## 23 2729 2782
## 24 2685 2699
## 25 2673 2783
## 26 2548 2832
## 27 2497 2638
## 28 2551 2666
## 29 2288 2436
## 30 2679 2867
## 31 2273 2533
## 32 2206 2544
## 33 2285 2530
## 34 2160 2425
## 35 2215 2615
## 36 2275 2522
## 37 2039 2391
## 38 2212 2447
## 39 2133 2461
## 40 2413 2766
## 41 2265 2413
## 42 2068 2445
## 43 2172 2538
## 44 1975 2331
## 45 2211 2509
## 46 2275 2507
## 47 2080 2443
## 48 2111 2469
## 49 2036 2451
## 50 2280 2545
## 51 2033 2267
## 52 1849 2248
## 53 1823 2198
## 54 1807 2043
## 55 1809 2102
## 56 1738 2096
## 57 1586 1927
## 58 1506 1767
## 59 1391 1728
## 60 1678 1992
## 61 1482 1736
## 62 1255 1432
## 63 1330 1596
## 64 1257 1439
## 65 1230 1443
## 66 1144 1324
## 67 936 1132
## 68 969 1199
## 69 834 1038
## 70 931 1107
## 71 766 953
## 72 714 864
## 73 675 906
## 74 595 715
## 75 661 818
## 76 594 707
## 77 454 581
## 78 505 590
## 79 399 474
## 80 470 527
## 81 337 436
## 82 243 337
## 83 249 341
## 84 225 337
## 85 240 309
## 86 179 262
## 87 146 234
## 88 126 184
## 89 85 160
## 90 97 166
## 91 62 105
## 92 50 72
## 93 62 82
## 94 42 59
## 95 26 55
## 96 20 48
## 97 56 109
## 98 90 97
## 99 20 14
Now, as I told you yesterday, this is the survey that we use to measure labor force participation. This information in in the “clase1” variable:
table(sdem$clase1)
##
## 0 1 2
## 68567 155794 126367
Go to the file descriptor and find out the meaning for 0,1,2…
We can then, restrict our analysis to those 12 years and older. We will also drop all observations which missing information, which are the members of the household that moved out.
sdem<-sdem[which(sdem$eda>=15 & sdem$eda<=97),]
sdem<-sdem[which(sdem$c_res!=2),]
table(sdem$clase1,sdem$sex)
##
## 1 2
## 0 173 172
## 1 92483 62143
## 2 32493 77143
Finally, there are some problems with the data, those zeroes shoulndt be there, lets remove them from the dataset:
sdem<-sdem[which(sdem$clase1>0),]
table(sdem$clase1,sdem$sex)
##
## 1 2
## 1 92483 62143
## 2 32493 77143
Now, the table we just created represent the surveyed population by gender and labor force status. But wait… shouldnt we have more people?
This is a survey, so we only have a sample of our population, we could estimate proportions, assuming that sampling was made following Single Random Sample (MAS in Spanish)
prop.table(table(sdem$clase1,sdem$sex))
##
## 1 2
## 1 0.3499671 0.2351568
## 2 0.1229575 0.2919186
This command is providing the proportions from the total, but what if we just want propirtions by row or column? We can fix that by adding the “margins” option.
prop.table(table(sdem$clase1,sdem$sex),margin=1) # by row
##
## 1 2
## 1 0.5981077 0.4018923
## 2 0.2963716 0.7036284
prop.table(table(sdem$clase1,sdem$sex),margin=2) # by column
##
## 1 2
## 1 0.7400061 0.4461540
## 2 0.2599939 0.5538460
And if we want percentages?
prop.table(table(sdem$clase1,sdem$sex),margin=1)*100 # by row
##
## 1 2
## 1 59.81077 40.18923
## 2 29.63716 70.36284
prop.table(table(sdem$clase1,sdem$sex),margin=2)*100 # by column
##
## 1 2
## 1 74.00061 44.61540
## 2 25.99939 55.38460
Now, is there a difference in this distribution? Here, we probably wont need any statistical test to see if the differences are significant, but what if we really want to see that?
chisq.test(table(sdem$clase1,sdem$sex))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(sdem$clase1, sdem$sex)
## X-squared = 23430, df = 1, p-value < 2.2e-16
We can explore more variables, now we can go with clase 2:
prop.table(table(sdem$clase2,sdem$sex),margin=2)*100
##
## 1 2
## 1 70.644764 42.751605
## 2 3.355844 1.863791
## 3 6.937332 9.114340
## 4 19.062060 46.270264
This distribution is missleading, we usually want the PEA and PNEA, and from those, indentify the percentages in occupation or not, as well as those who are availale to work and thos who arent.
The best way to do this is to create subsets and then get the proportions:
pea<- sdem[which(sdem$clase1==1),]
prop.table(table(pea$clase2,pea$sex),margin=2)*100
##
## 1 2
## 1 95.465113 95.822538
## 2 4.534887 4.177462
chisq.test(table(pea$clase2,pea$sex))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(pea$clase2, pea$sex)
## X-squared = 11.225, df = 1, p-value = 0.0008071
remove(pea)
pnea<- sdem[which(sdem$clase1==2),]
prop.table(table(pnea$clase2,pnea$sex),margin=2)*100
##
## 1 2
## 3 26.68267 16.45645
## 4 73.31733 83.54355
chisq.test(table(pnea$clase2,pnea$sex))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(pnea$clase2, pnea$sex)
## X-squared = 1523.2, df = 1, p-value < 2.2e-16
remove(pnea)
Now, lets explore the population by condition of occupation. We have some basic groups: - Trabajadores subordinados y remunerados (paid employees) - Empleadore (employers) - Trabajadores por cuenta propia (independent workers) - Trabajadores sin pago (unpaid workers)
peao<-sdem[which(sdem$clase2==1),]
table(peao$pos_ocu)
##
## 1 2 3 4
## 104674 7284 30775 5103
But, what if we want the names of the categories instead if numbers?
peao$pos_ocu<-factor(peao$pos_ocu,
levels=c(1,2,3,4),
labels=c("Employees","Employers","Independent","Unpaid"))
table(peao$pos_ocu)
##
## Employees Employers Independent Unpaid
## 104674 7284 30775 5103
table(peao$pos_ocu,peao$sex)
##
## 1 2
## Employees 62752 41922
## Employers 5666 1618
## Independent 17798 12977
## Unpaid 2073 3030
Now, we can do the same with sex:
peao$sex<-factor(peao$sex,
levels=c(1,2),
labels=c("Men","Women"))
table(peao$pos_ocu,peao$sex)
##
## Men Women
## Employees 62752 41922
## Employers 5666 1618
## Independent 17798 12977
## Unpaid 2073 3030
prop.table(table(peao$pos_ocu,peao$sex),margin=2)*100
##
## Men Women
## Employees 71.075672 70.401532
## Employers 6.417561 2.717181
## Independent 20.158797 21.792869
## Unpaid 2.347971 5.088418
chisq.test(table(peao$pos_ocu,peao$sex))
##
## Pearson's Chi-squared test
##
## data: table(peao$pos_ocu, peao$sex)
## X-squared = 1809.9, df = 3, p-value < 2.2e-16
Now, we can explore access to Social Security:
peao$seg_soc<-factor(peao$seg_soc,
levels=c(1,2,3),
labels=c("With","Without","Unsp"))
prop.table(table(peao$seg_soc,peao$sex),margin=2)*100
##
## Men Women
## With 42.6825539 44.5614389
## Without 56.5902887 54.8222413
## Unsp 0.7271574 0.6163199
Now, explore income level:
peao$ing7c<-factor(peao$ing7c,
levels=c(1,2,3,4,5,6,7),
labels=c("1SM","1to2SM","2to3SM","3to5SM",
"More5SM","NoIncome","Unsp"))
prop.table(table(peao$ing7c,peao$sex),margin=2)*100
##
## Men Women
## 1SM 18.579891 30.669891
## 1to2SM 37.986612 34.512234
## 2to3SM 15.803781 10.862008
## 3to5SM 7.655540 5.362151
## More5SM 2.923354 1.649118
## NoIncome 3.480615 5.180782
## Unsp 13.570207 11.763817
The survey also has information in continuous version:
summary(peao$hrsocup)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 30.00 44.00 40.28 50.00 168.00
summary(peao$ingocup)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 4500 5425 7740 245100
summary(peao$ing_x_hrs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 24.92 34.00 41.67 6201.55
If we want to obtain indicators by gender, we have to use a library… well… we don’t have to… but it is the easier way:
library(doBy)
## Registered S3 methods overwritten by 'tibble':
## method from
## format.tbl pillar
## print.tbl pillar
summaryBy(hrsocup~sex,data=peao,FUN=mean)
## sex hrsocup.mean
## 1 Men 43.33586
## 2 Women 35.75433
summaryBy(ingocup~sex,data=peao,FUN=mean)
## sex ingocup.mean
## 1 Men 5901.897
## 2 Women 4718.569
summaryBy(ing_x_hrs~sex,data=peao,FUN=mean)
## sex ing_x_hrs.mean
## 1 Men 33.79477
## 2 Women 34.30119
And if we want to see if these values are in fact, different:
t.test(hrsocup~sex,data=peao)
##
## Welch Two Sample t-test
##
## data: hrsocup by sex
## t = 76.248, df = 127292, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 7.386649 7.776420
## sample estimates:
## mean in group Men mean in group Women
## 43.33586 35.75433
t.test(ingocup~sex,data=peao)
##
## Welch Two Sample t-test
##
## data: ingocup by sex
## t = 35.636, df = 144053, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1118.245 1248.411
## sample estimates:
## mean in group Men mean in group Women
## 5901.897 4718.569
t.test(ing_x_hrs~sex,data=peao)
##
## Welch Two Sample t-test
##
## data: ing_x_hrs by sex
## t = -1.4243, df = 129399, p-value = 0.1544
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.2033296 0.1904894
## sample estimates:
## mean in group Men mean in group Women
## 33.79477 34.30119
Now, this is a survey, a representative survey… which means that we should be capable to estimate information for the country.
If we just go with a table:
table(peao$pos_ocu,peao$sex)
##
## Men Women
## Employees 62752 41922
## Employers 5666 1618
## Independent 17798 12977
## Unpaid 2073 3030
We are obtaining information for the sample, we can’t assume this is the same for the country. For instance, if we estimated the proportions, can we say these proportions are the same at the national level?
If were on the context of SRS, we could use these percentages. H However, INEGI uses a more complex sampling approach.
The first key varible is factor.But we need a new library:
library(questionr)
wtd.table(peao$pos_ocu,peao$sex,weights=peao$fac_tri)
## Men Women
## Employees 22421744 13905588
## Employers 2046667 544798
## Independent 7264378 4680928
## Unpaid 917523 1157894
We have some problems to perform, basi t-tests in weighted mode.
We see this better if we use some more interesting approach.
Lets fit a linear regression to income per hour, we will use as predictors the variables sex, eda, pos_ocu… The first model is easy:
model1<-lm(ing_x_hrs~sex+eda+pos_ocu,data=peao)
summary(model1)
##
## Call:
## lm(formula = ing_x_hrs ~ sex + eda + pos_ocu, data = peao)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.1 -32.9 -8.5 7.0 6166.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.17645 0.55258 56.420 < 2e-16 ***
## sexWomen 2.01327 0.35700 5.639 1.71e-08 ***
## eda 0.05898 0.01304 4.524 6.07e-06 ***
## pos_ocuEmployers 14.81179 0.82195 18.020 < 2e-16 ***
## pos_ocuIndependent 0.57878 0.44827 1.291 0.197
## pos_ocuUnpaid -34.35715 0.96309 -35.674 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.91 on 147830 degrees of freedom
## Multiple R-squared: 0.01176, Adjusted R-squared: 0.01172
## F-statistic: 351.8 on 5 and 147830 DF, p-value: < 2.2e-16
Now, we will use the weighted approach:
model1wt<-lm(ing_x_hrs~sex+eda+pos_ocu,data=peao,
weights=fac_tri)
summary(model1wt)
##
## Call:
## lm(formula = ing_x_hrs ~ sex + eda + pos_ocu, data = peao, weights = fac_tri)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -6125 -339 -72 137 174854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.56213 0.50534 58.499 < 2e-16 ***
## sexWomen 3.15277 0.33008 9.552 < 2e-16 ***
## eda 0.02211 0.01193 1.853 0.063845 .
## pos_ocuEmployers 12.92165 0.75804 17.046 < 2e-16 ***
## pos_ocuIndependent -1.37278 0.40113 -3.422 0.000621 ***
## pos_ocuUnpaid -32.04295 0.83270 -38.481 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1161 on 147830 degrees of freedom
## Multiple R-squared: 0.01281, Adjusted R-squared: 0.01278
## F-statistic: 383.7 on 5 and 147830 DF, p-value: < 2.2e-16
Now, the most appopriate way to do this:
library(survey)
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
##
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
##
## dotchart
design<-svydesign(id=~upm,weights=~fac_tri,strata=~est_d_tri,data=peao,
lonely.psu="certainty")
options(survey.lonely.psu="adjust")
model1d<-svyglm(ing_x_hrs~sex+eda+pos_ocu,design=design,family=gaussian)
summary(model1d)
##
## Call:
## svyglm(formula = ing_x_hrs ~ sex + eda + pos_ocu, design = design,
## family = gaussian)
##
## Survey design:
## svydesign(id = ~upm, weights = ~fac_tri, strata = ~est_d_tri,
## data = peao, lonely.psu = "certainty")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.56213 0.71644 41.262 < 2e-16 ***
## sexWomen 3.15277 0.58766 5.365 8.2e-08 ***
## eda 0.02211 0.01885 1.173 0.2408
## pos_ocuEmployers 12.92165 1.44152 8.964 < 2e-16 ***
## pos_ocuIndependent -1.37278 0.81272 -1.689 0.0912 .
## pos_ocuUnpaid -32.04295 0.38757 -82.676 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 3763.835)
##
## Number of Fisher Scoring iterations: 2
Now, we are going to merge the coe1 with the peao… For this, we need to indicate R the variables that identify each observation:
coe1<-read.csv("ENOEN_COE1T121.csv")
peao<-merge(peao,coe1,by=c("cd_a","ent","con","v_sel","tipo","mes_cal",
"n_hog","h_mud","n_ren"))
remove(sdem,coe1)
names(peao)
## [1] "cd_a" "ent" "con" "v_sel" "tipo"
## [6] "mes_cal" "n_hog" "h_mud" "n_ren" "r_def.x"
## [11] "loc" "mun" "est" "est_d_tri" "est_d_men"
## [16] "ageb" "t_loc_tri" "t_loc_men" "upm.x" "d_sem.x"
## [21] "n_pro_viv.x" "n_ent.x" "per.x" "c_res" "par_c"
## [26] "sex" "eda.x" "nac_dia" "nac_mes" "nac_anio"
## [31] "l_nac_c" "cs_p12" "cs_p13_1" "cs_p13_2" "cs_p14_c"
## [36] "cs_p15" "cs_p16" "cs_p17" "n_hij" "e_con"
## [41] "cs_ad_mot" "cs_p20_des" "cs_ad_des" "cs_nr_mot" "cs_p22_des"
## [46] "cs_nr_ori" "ur.x" "zona" "salario" "fac_tri.x"
## [51] "fac_men.x" "clase1" "clase2" "clase3" "pos_ocu"
## [56] "seg_soc" "rama" "c_ocu11c" "ing7c" "dur9c"
## [61] "emple7c" "medica5c" "buscar5c" "rama_est1" "rama_est2"
## [66] "dur_est" "ambito1" "ambito2" "tue1" "tue2"
## [71] "tue3" "busqueda" "d_ant_lab" "d_cexp_est" "dur_des"
## [76] "sub_o" "s_clasifi" "remune2c" "pre_asa" "tip_con"
## [81] "dispo" "nodispo" "c_inac5c" "pnea_est" "niv_ins"
## [86] "eda5c" "eda7c" "eda12c" "eda19c" "hij5c"
## [91] "domestico" "anios_esc" "hrsocup" "ingocup" "ing_x_hrs"
## [96] "tpg_p8a" "tcco" "cp_anoc" "imssissste" "ma48me1sm"
## [101] "p14apoyos" "scian" "t_tra" "emp_ppal" "tue_ppal"
## [106] "trans_ppal" "mh_fil2" "mh_col" "sec_ins" "ca.x"
## [111] "r_def.y" "upm.y" "d_sem.y" "n_pro_viv.y" "n_ent.y"
## [116] "per.y" "eda.y" "n_inf" "p1" "p1a1"
## [121] "p1a2" "p1a3" "p1b" "p1c" "p1d"
## [126] "p1e" "p2_1" "p2_2" "p2_3" "p2_4"
## [131] "p2_9" "p2a_dia" "p2a_sem" "p2a_mes" "p2a_anio"
## [136] "p2b_dia" "p2b_sem" "p2b_mes" "p2b_anio" "p2b"
## [141] "p2c" "p2d1" "p2d2" "p2d3" "p2d4"
## [146] "p2d5" "p2d6" "p2d7" "p2d8" "p2d9"
## [151] "p2d10" "p2d11" "p2d99" "p2e" "p2f"
## [156] "p2g1" "p2g2" "p2h1" "p2h2" "p2h3"
## [161] "p2h4" "p2h9" "p3" "p3a" "p3b"
## [166] "p3c1" "p3c2" "p3c3" "p3c4" "p3c9"
## [171] "p3d" "p3e" "p3f1" "p3f2" "p3g1_1"
## [176] "p3g1_2" "p3g2_1" "p3g2_2" "p3g3_1" "p3g3_2"
## [181] "p3g4_1" "p3g4_2" "p3g9" "p3g_tot" "p3h"
## [186] "p3i" "p3j" "p3k1" "p3k2" "p3l1"
## [191] "p3l2" "p3l3" "p3l4" "p3l5" "p3l9"
## [196] "p3m1" "p3m2" "p3m3" "p3m4" "p3m5"
## [201] "p3m6" "p3m7" "p3m8" "p3m9" "p3n"
## [206] "p3o" "p3p1" "p3p2" "p3q" "p3r_anio"
## [211] "p3r_mes" "p3r" "p3s" "p3t_anio" "p3t_mes"
## [216] "p4" "p4_1" "p4_2" "p4_3" "p4a"
## [221] "p4a_1" "p4b" "p4c" "p4d1" "p4d2"
## [226] "p4d3" "p4e" "p4f" "p4g" "p4h"
## [231] "p4i" "p4i_1" "p5" "p5a" "p5b"
## [236] "p5c_hlu" "p5c_mlu" "p5c_hma" "p5c_mma" "p5c_hmi"
## [241] "p5c_mmi" "p5c_hju" "p5c_mju" "p5c_hvi" "p5c_mvi"
## [246] "p5c_hsa" "p5c_msa" "p5c_hdo" "p5c_mdo" "p5c_thrs"
## [251] "p5c_tdia" "p5d" "p5e1" "p5e_hlu" "p5e_mlu"
## [256] "p5e_hma" "p5e_mma" "p5e_hmi" "p5e_mmi" "p5e_hju"
## [261] "p5e_mju" "p5e_hvi" "p5e_mvi" "p5e_hsa" "p5e_msa"
## [266] "p5e_hdo" "p5e_mdo" "p5e_thrs" "p5e_tdia" "p5f"
## [271] "p5g1" "p5g2" "p5g3" "p5g4" "p5g5"
## [276] "p5g6" "p5g7" "p5g8" "p5g9" "p5g10"
## [281] "p5g11" "p5g12" "p5g13" "p5g14" "p5g15"
## [286] "p5g99" "p5h" "ur.y" "ca.y" "fac_tri.y"
## [291] "fac_men.y"
Now, I will select the following occupations: - 22: Investigadores y profesionistas en ciencias exactas… - 221: Investigadores y profesionistas en fisica, matematicas, estadistica y actuaria - 222: Investigadores y profesionistas en ciencias biologicas, quimicas y del medio ambiente
table(peao$p3)
##
## 1111 1112 1113 1121 1122 1131 1132 1133 1134 1135 1211 1212 1221 1222 1223 1224
## 3 3 39 30 143 16 26 6 2 7 259 139 57 40 311 50
## 1225 1226 1311 1312 1313 1314 1315 1321 1322 1323 1324 1411 1412 1421 1422 1423
## 2 26 43 1 10 294 147 16 14 81 2 689 325 39 60 66
## 1511 1512 1521 1522 1523 1524 1525 1526 1611 1612 1613 1614 1615 1619 1621 1622
## 538 124 120 95 178 113 2 61 14 14 29 248 239 1 67 51
## 1623 1624 1629 1711 1712 1721 1722 1723 1999 2111 2112 2113 2121 2122 2123 2131
## 120 12 8 152 39 9 74 31 8 255 179 5 1282 71 18 7
## 2132 2133 2134 2135 2141 2142 2143 2144 2145 2151 2152 2153 2161 2162 2163 2164
## 21 33 2 1173 3 349 138 21 85 8 92 35 35 188 9 4
## 2171 2172 2173 2174 2175 2211 2212 2221 2222 2223 2231 2232 2233 2241 2242 2251
## 3 225 27 8 5 2 14 27 165 36 81 134 4 65 157 37
## 2252 2253 2254 2261 2262 2263 2271 2272 2281 2311 2312 2321 2322 2331 2332 2333
## 91 235 30 328 40 395 751 38 21 105 83 844 590 1074 1960 27
## 2334 2335 2339 2341 2342 2343 2391 2411 2412 2413 2421 2422 2423 2424 2425 2426
## 1 861 3 13 22 83 3 658 316 342 6 24 81 12 15 1045
## 2427 2428 2511 2512 2513 2514 2521 2522 2523 2524 2531 2532 2533 2541 2542 2543
## 109 47 635 1202 67 63 102 13 37 37 317 117 26 25 24 203
## 2544 2551 2552 2553 2561 2562 2563 2611 2612 2613 2614 2621 2622 2623 2624 2625
## 82 31 44 4 11 40 8 20 93 25 49 64 70 13 55 100
## 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645
## 163 68 1639 424 512 13 36 270 88 5 105 151 833 333 365 45
## 2646 2651 2652 2653 2654 2655 2661 2662 2711 2712 2713 2714 2715 2716 2811 2812
## 198 529 25 63 64 128 5 32 262 107 11 81 270 197 337 150
## 2813 2814 2815 2816 2817 2821 2822 2823 2824 2825 2826 2827 2991 2992 3101 3111
## 69 16 49 10 9 348 86 156 2 95 24 97 4 3 349 1780
## 3112 3113 3114 3115 3121 3122 3131 3132 3141 3142 3201 3211 3212 3213 3221 3222
## 12 464 35 3826 1992 382 233 1558 46 69 56 994 232 128 63 53
## 3231 3232 3999 4111 4201 4211 4212 4213 4214 4221 4222 4223 4224 4231 4232 4233
## 82 14 4 7067 762 8514 339 100 564 997 621 286 1365 347 30 2
## 4311 4312 4999 5101 5111 5112 5113 5114 5115 5116 5201 5211 5212 5213 5221 5222
## 47 105 5 326 1144 977 38 2425 151 1253 41 1144 420 66 242 589
## 5231 5241 5242 5251 5252 5253 5254 5301 5311 5312 5313 5314 5401 5411 5412 5413
## 11 354 328 10 1 24 75 99 57 803 2391 113 10 3 53 170
## 5999 6101 6111 6112 6113 6114 6115 6116 6117 6121 6122 6123 6124 6125 6126 6127
## 2 143 2733 531 181 569 22 314 133 628 106 131 214 39 23 11
## 6128 6129 6131 6201 6211 6212 6213 6221 6222 6223 6224 6225 6226 6227 6231 6311
## 12 3 337 11 312 30 26 15 2 136 14 29 8 29 5 159
## 6999 7101 7111 7112 7113 7121 7122 7123 7131 7132 7133 7134 7135 7201 7211 7212
## 2 452 32 41 13 4078 9 37 114 114 256 594 535 23 127 360
## 7213 7214 7221 7222 7223 7301 7311 7312 7313 7321 7322 7323 7331 7332 7341 7342
## 493 32 955 86 47 25 945 46 43 82 182 4 42 188 874 119
## 7343 7344 7351 7352 7353 7401 7411 7412 7501 7511 7512 7513 7514 7515 7516 7517
## 171 5 114 49 255 7 233 67 59 682 111 2253 66 366 7 78
## 7601 7611 7612 7613 7614 7999 8101 8111 8112 8113 8114 8121 8122 8123 8131 8132
## 5 72 285 63 13 29 887 52 32 67 17 104 124 818 108 37
## 8133 8134 8135 8141 8142 8143 8144 8145 8151 8152 8153 8154 8155 8161 8162 8163
## 1052 19 22 57 53 36 10 114 69 60 761 341 101 251 4 110
## 8171 8172 8173 8181 8199 8201 8211 8212 8301 8311 8321 8322 8323 8324 8331 8341
## 36 73 52 20 35 692 1091 1663 65 16 17 12 4 13 11 2901
## 8342 8343 8344 8349 8351 8352 8999 9111 9112 9113 9121 9122 9124 9211 9212 9221
## 3135 138 792 1 224 630 55 3418 473 76 54 29 30 35 22 3689
## 9222 9231 9232 9233 9234 9235 9236 9237 9239 9311 9312 9321 9322 9331 9332 9411
## 95 1477 390 374 545 400 1028 121 6 73 6 20 2 767 2 1360
## 9511 9512 9521 9601 9611 9621 9622 9623 9624 9631 9632 9633 9641 9642 9643 9651
## 30 1560 2075 164 4656 2893 195 235 23 359 82 73 207 52 136 38
## 9661 9662 9663 9711 9712 9713 9721 9722 9723 9731 9899 9999
## 389 107 5 43 116 59 21 243 100 37 10 162
peao$p3<-as.numeric(as.character(peao$p3))
peao$ciencias<-ifelse(peao$p3>=2200 & peao$p3<2300,1,0)
table(peao$ciencias)
##
## 0 1
## 145185 2651
Seleccionamis solo los que pertenecen a ciencias, y tambien a los menores de 59 anios, edad de retiro…
ciencias<-peao[which(peao$ciencias==1),]
remove(peao)
ciencias<-ciencias[which(ciencias$eda.x<=59),]
Now, we are going to select some columns (we dont them all)
vector<-c("cd_a","ent","con","v_sel","tipo","mes_cal","n_hog","h_mud",
"n_ren","upm.x","fac_tri.x","est_d_tri","eda.x","sex"
,"pos_ocu","e_con","salario","seg_soc","hrsocup","ingocup",
"ing_x_hrs","emp_ppal","p3","p3a","p3m4","p3t_anio",
"p3t_mes")
ciencias<-ciencias[vector]
remove(vector)
And now, lets merge coe2
coe2<-read.csv("ENOEN_COE2T121.csv")
ciencias<-merge(ciencias,coe2,by=c("cd_a","ent","con","v_sel","tipo",
"mes_cal", "n_hog","h_mud","n_ren"))
remove(coe2)
We are going to construct some variables we will need to our private pension plan:
First: Population by salary in times the minimum wage:
ciencias$salariom<-0
ciencias$salariom<-ifelse(ciencias$p6b1==1,ciencias$p6b2,ciencias$salariom)
ciencias$salariom<-ifelse(ciencias$p6b1==2,ciencias$p6b2*2,ciencias$salariom)
ciencias$salariom<-ifelse(ciencias$p6b1==3,(ciencias$p6b2/7)*30,ciencias$salariom)
ciencias$salariom<-ifelse(ciencias$p6b1==4,ciencias$p6b2*30,ciencias$salariom)
ciencias$salariom<-ifelse(ciencias$p6b1>=5,-1,ciencias$salariom)
ciencias$salariom<-ifelse(ciencias$p6b2==999999,-1,ciencias$salariom)
ciencias<-ciencias[which(ciencias$salariom>0),]
ciencias$salariom<-ciencias$salariom / ciencias$salario
ciencias$tsm<--1
ciencias$tsm<-ifelse(ciencias$salariom>=0 & ciencias$salariom<1,0,ciencias$tsm)
ciencias$tsm<-ifelse(ciencias$salariom>=1 & ciencias$salariom<2,1,ciencias$tsm)
ciencias$tsm<-ifelse(ciencias$salariom>=2 & ciencias$salariom<3,2,ciencias$tsm)
ciencias$tsm<-ifelse(ciencias$salariom>=3 & ciencias$salariom<4,3,ciencias$tsm)
ciencias$tsm<-ifelse(ciencias$salariom>=4 & ciencias$salariom<5,4,ciencias$tsm)
ciencias$tsm<-ifelse(ciencias$salariom>=5 & ciencias$salariom<6,5,ciencias$tsm)
ciencias$tsm<-ifelse(ciencias$salariom>=6 & ciencias$salariom<7,6,ciencias$tsm)
ciencias$tsm<-ifelse(ciencias$salariom>=7 & ciencias$salariom<8,7,ciencias$tsm)
ciencias$tsm<-ifelse(ciencias$salariom>=8 & ciencias$salariom<9,8,ciencias$tsm)
ciencias$tsm<-ifelse(ciencias$salariom>=9 & ciencias$salariom<10,9,ciencias$tsm)
ciencias$tsm<-ifelse(ciencias$salariom>=10 & ciencias$salariom<15,10,ciencias$tsm)
ciencias$tsm<-ifelse(ciencias$salariom>=15 & ciencias$salariom<20,15,ciencias$tsm)
ciencias$tsm<-ifelse(ciencias$salariom>=20 & ciencias$salariom<25,20,ciencias$tsm)
ciencias$tsm<-ifelse(ciencias$salariom>=25,25,ciencias$tsm)
table(ciencias$tsm)
##
## 0 1 2 3 4 5 6 7 8 9 10 15 20 25
## 25 73 132 166 201 132 104 100 89 68 168 62 27 42
And now, for age groups:
ciencias$gedad<-0
ciencias$gedad<-ifelse(ciencias$eda<=20,20,ciencias$gedad)
ciencias$gedad<-ifelse(ciencias$eda>=21 & ciencias$eda<=25,25,ciencias$gedad)
ciencias$gedad<-ifelse(ciencias$eda>=26 & ciencias$eda<=30,30,ciencias$gedad)
ciencias$gedad<-ifelse(ciencias$eda>=31 & ciencias$eda<=35,35,ciencias$gedad)
ciencias$gedad<-ifelse(ciencias$eda>=36 & ciencias$eda<=40,40,ciencias$gedad)
ciencias$gedad<-ifelse(ciencias$eda>=41 & ciencias$eda<=45,45,ciencias$gedad)
ciencias$gedad<-ifelse(ciencias$eda>=46 & ciencias$eda<=50,50,ciencias$gedad)
ciencias$gedad<-ifelse(ciencias$eda>=51 & ciencias$eda<=55,55,ciencias$gedad)
ciencias$gedad<-ifelse(ciencias$eda>=56 & ciencias$eda<=60,60,ciencias$gedad)
ciencias$gedad<-ifelse(ciencias$eda>=61 & ciencias$eda<=65,65,ciencias$gedad)
ciencias$gedad<-ifelse(ciencias$eda>=66,66,ciencias$gedad)
table(ciencias$gedad)
##
## 25 30 35 40 45 50 55 60
## 199 294 301 178 159 117 86 55
Since we created new variables and now we have a new dataset we have to define the design of the survey again.
We can explore some option:
design<-svydesign(id=~upm.x,weights=~fac_tri.x,strata=~est_d_tri,
data=ciencias,lonely.psu="certainty")
options(survey.lonely.psu="adjust")
svytable(~sex, design=design)
## sex
## Men Women
## 301020 80984
svytable(~gedad+sex, design=design)
## sex
## gedad Men Women
## 25 44140 15099
## 30 54392 23865
## 35 56760 21266
## 40 42205 6912
## 45 37565 5477
## 50 30863 5818
## 55 19118 2328
## 60 15977 219
svychisq(~gedad+sex, design=design)
##
## Pearson's X^2: Rao & Scott adjustment
##
## data: svychisq(~gedad + sex, design = design)
## F = 3.6408, ndf = 5.6789, ddf = 5213.2474, p-value = 0.001659
svychisq(~tsm+sex, design=design)
##
## Pearson's X^2: Rao & Scott adjustment
##
## data: svychisq(~tsm + sex, design = design)
## F = 2.065, ndf = 9.4485, ddf = 8673.7399, p-value = 0.02657
svymean(~eda,design=design)
## mean SE
## eda 35.935 0.4902
svymean(~salariom,design=design)
## mean SE
## salariom 8.3299 0.4479
model<-svyglm(eda~sex,family="gaussian",design=design)
summary(model)
##
## Call:
## svyglm(formula = eda ~ sex, design = design, family = "gaussian")
##
## Survey design:
## svydesign(id = ~upm.x, weights = ~fac_tri.x, strata = ~est_d_tri,
## data = ciencias, lonely.psu = "certainty")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.8510 0.5889 62.579 < 2e-16 ***
## sexWomen -4.3198 0.9154 -4.719 2.74e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 91.93228)
##
## Number of Fisher Scoring iterations: 2
model<-svyglm(salariom~sex,family="gaussian",design=design)
summary(model)
##
## Call:
## svyglm(formula = salariom ~ sex, design = design, family = "gaussian")
##
## Survey design:
## svydesign(id = ~upm.x, weights = ~fac_tri.x, strata = ~est_d_tri,
## data = ciencias, lonely.psu = "certainty")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9408 0.5404 16.544 < 2e-16 ***
## sexWomen -2.8816 0.6805 -4.234 2.52e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 125.8619)
##
## Number of Fisher Scoring iterations: 2