LAT4072_ENOE

R Markdown

In this session, we will learn how to use the ENOE, a really important survey collecting information on employment and occupations.

This survey is the one you will have to use for your final project.

setwd("~/Downloads/enoe_n_2021_trim1_csv")
sdem<-read.csv("ENOEN_SDEMT121.csv")

Remember, my advice is to always start with a basic analisis of the dataset. We can use the “names” command to explore the variables in our dataset

names(sdem)

##   [1] "r_def"      "loc"        "mun"        "est"        "est_d_tri" 
##   [6] "est_d_men"  "ageb"       "t_loc_tri"  "t_loc_men"  "cd_a"      
##  [11] "ent"        "con"        "upm"        "d_sem"      "n_pro_viv" 
##  [16] "v_sel"      "n_hog"      "h_mud"      "n_ent"      "per"       
##  [21] "n_ren"      "c_res"      "par_c"      "sex"        "eda"       
##  [26] "nac_dia"    "nac_mes"    "nac_anio"   "l_nac_c"    "cs_p12"    
##  [31] "cs_p13_1"   "cs_p13_2"   "cs_p14_c"   "cs_p15"     "cs_p16"    
##  [36] "cs_p17"     "n_hij"      "e_con"      "cs_ad_mot"  "cs_p20_des"
##  [41] "cs_ad_des"  "cs_nr_mot"  "cs_p22_des" "cs_nr_ori"  "ur"        
##  [46] "zona"       "salario"    "fac_tri"    "fac_men"    "clase1"    
##  [51] "clase2"     "clase3"     "pos_ocu"    "seg_soc"    "rama"      
##  [56] "c_ocu11c"   "ing7c"      "dur9c"      "emple7c"    "medica5c"  
##  [61] "buscar5c"   "rama_est1"  "rama_est2"  "dur_est"    "ambito1"   
##  [66] "ambito2"    "tue1"       "tue2"       "tue3"       "busqueda"  
##  [71] "d_ant_lab"  "d_cexp_est" "dur_des"    "sub_o"      "s_clasifi" 
##  [76] "remune2c"   "pre_asa"    "tip_con"    "dispo"      "nodispo"   
##  [81] "c_inac5c"   "pnea_est"   "niv_ins"    "eda5c"      "eda7c"     
##  [86] "eda12c"     "eda19c"     "hij5c"      "domestico"  "anios_esc" 
##  [91] "hrsocup"    "ingocup"    "ing_x_hrs"  "tpg_p8a"    "tcco"      
##  [96] "cp_anoc"    "imssissste" "ma48me1sm"  "p14apoyos"  "scian"     
## [101] "t_tra"      "emp_ppal"   "tue_ppal"   "trans_ppal" "mh_fil2"   
## [106] "mh_col"     "sec_ins"    "tipo"       "mes_cal"    "ca"

But, what is the meaning of all these variables. INEGI is really good at providing information. We have several options, but the things we should look at are: - File descriptor - Questionnaire - Additional resources

If we go to the file discriptor, we will find that the dataset is split into different files. The one I openned was the sociodemographic file. It cointains many of the basic variables of interest.

The first variables we see are related to the methodology. And are used to control for the survey design (we will talk about this later), but we will also find variables that will allow us to merge different datasets.

We can start with some basic procedures:

Summary of age
Tabulate sex
Cross tab age and sex

summary(sdem$eda)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   16.00   31.00   33.57   49.00   99.00    6876

table(sdem$sex)

## 
##      1      2 
## 165625 178227

table(sdem$eda, sdem$sex)

##     
##         1    2
##   0  1658 1702
##   1  2053 1994
##   2  2402 2211
##   3  2444 2415
##   4  2577 2544
##   5  2676 2532
##   6  2744 2683
##   7  2814 2671
##   8  2873 2807
##   9  2964 2765
##   10 2994 2880
##   11 2991 2906
##   12 3125 2850
##   13 3043 2840
##   14 3008 2858
##   15 3113 2919
##   16 3028 2998
##   17 2982 2901
##   18 3102 3057
##   19 3088 2935
##   20 3069 3054
##   21 2916 2920
##   22 2851 2841
##   23 2729 2782
##   24 2685 2699
##   25 2673 2783
##   26 2548 2832
##   27 2497 2638
##   28 2551 2666
##   29 2288 2436
##   30 2679 2867
##   31 2273 2533
##   32 2206 2544
##   33 2285 2530
##   34 2160 2425
##   35 2215 2615
##   36 2275 2522
##   37 2039 2391
##   38 2212 2447
##   39 2133 2461
##   40 2413 2766
##   41 2265 2413
##   42 2068 2445
##   43 2172 2538
##   44 1975 2331
##   45 2211 2509
##   46 2275 2507
##   47 2080 2443
##   48 2111 2469
##   49 2036 2451
##   50 2280 2545
##   51 2033 2267
##   52 1849 2248
##   53 1823 2198
##   54 1807 2043
##   55 1809 2102
##   56 1738 2096
##   57 1586 1927
##   58 1506 1767
##   59 1391 1728
##   60 1678 1992
##   61 1482 1736
##   62 1255 1432
##   63 1330 1596
##   64 1257 1439
##   65 1230 1443
##   66 1144 1324
##   67  936 1132
##   68  969 1199
##   69  834 1038
##   70  931 1107
##   71  766  953
##   72  714  864
##   73  675  906
##   74  595  715
##   75  661  818
##   76  594  707
##   77  454  581
##   78  505  590
##   79  399  474
##   80  470  527
##   81  337  436
##   82  243  337
##   83  249  341
##   84  225  337
##   85  240  309
##   86  179  262
##   87  146  234
##   88  126  184
##   89   85  160
##   90   97  166
##   91   62  105
##   92   50   72
##   93   62   82
##   94   42   59
##   95   26   55
##   96   20   48
##   97   56  109
##   98   90   97
##   99   20   14

Now, as I told you yesterday, this is the survey that we use to measure labor force participation. This information in in the “clase1” variable:

table(sdem$clase1)

## 
##      0      1      2 
##  68567 155794 126367

Go to the file descriptor and find out the meaning for 0,1,2…

We can then, restrict our analysis to those 12 years and older. We will also drop all observations which missing information, which are the members of the household that moved out.

sdem<-sdem[which(sdem$eda>=15 & sdem$eda<=97),]
sdem<-sdem[which(sdem$c_res!=2),]
table(sdem$clase1,sdem$sex)

##    
##         1     2
##   0   173   172
##   1 92483 62143
##   2 32493 77143

Finally, there are some problems with the data, those zeroes shoulndt be there, lets remove them from the dataset:

sdem<-sdem[which(sdem$clase1>0),]
table(sdem$clase1,sdem$sex)

##    
##         1     2
##   1 92483 62143
##   2 32493 77143

Now, the table we just created represent the surveyed population by gender and labor force status. But wait… shouldnt we have more people?

This is a survey, so we only have a sample of our population, we could estimate proportions, assuming that sampling was made following Single Random Sample (MAS in Spanish)

prop.table(table(sdem$clase1,sdem$sex))

##    
##             1         2
##   1 0.3499671 0.2351568
##   2 0.1229575 0.2919186

This command is providing the proportions from the total, but what if we just want propirtions by row or column? We can fix that by adding the “margins” option.

prop.table(table(sdem$clase1,sdem$sex),margin=1) # by row

##    
##             1         2
##   1 0.5981077 0.4018923
##   2 0.2963716 0.7036284

prop.table(table(sdem$clase1,sdem$sex),margin=2) # by column

##    
##             1         2
##   1 0.7400061 0.4461540
##   2 0.2599939 0.5538460

And if we want percentages?

prop.table(table(sdem$clase1,sdem$sex),margin=1)*100 # by row

##    
##            1        2
##   1 59.81077 40.18923
##   2 29.63716 70.36284

prop.table(table(sdem$clase1,sdem$sex),margin=2)*100 # by column

##    
##            1        2
##   1 74.00061 44.61540
##   2 25.99939 55.38460

Now, is there a difference in this distribution? Here, we probably wont need any statistical test to see if the differences are significant, but what if we really want to see that?

chisq.test(table(sdem$clase1,sdem$sex))

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(sdem$clase1, sdem$sex)
## X-squared = 23430, df = 1, p-value < 2.2e-16

We can explore more variables, now we can go with clase 2:

prop.table(table(sdem$clase2,sdem$sex),margin=2)*100

##    
##             1         2
##   1 70.644764 42.751605
##   2  3.355844  1.863791
##   3  6.937332  9.114340
##   4 19.062060 46.270264

This distribution is missleading, we usually want the PEA and PNEA, and from those, indentify the percentages in occupation or not, as well as those who are availale to work and thos who arent.

The best way to do this is to create subsets and then get the proportions:

pea<- sdem[which(sdem$clase1==1),]
prop.table(table(pea$clase2,pea$sex),margin=2)*100

##    
##             1         2
##   1 95.465113 95.822538
##   2  4.534887  4.177462

chisq.test(table(pea$clase2,pea$sex))

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(pea$clase2, pea$sex)
## X-squared = 11.225, df = 1, p-value = 0.0008071

remove(pea)

pnea<- sdem[which(sdem$clase1==2),]
prop.table(table(pnea$clase2,pnea$sex),margin=2)*100

##    
##            1        2
##   3 26.68267 16.45645
##   4 73.31733 83.54355

chisq.test(table(pnea$clase2,pnea$sex))

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(pnea$clase2, pnea$sex)
## X-squared = 1523.2, df = 1, p-value < 2.2e-16

remove(pnea)

Now, lets explore the population by condition of occupation. We have some basic groups: - Trabajadores subordinados y remunerados (paid employees) - Empleadore (employers) - Trabajadores por cuenta propia (independent workers) - Trabajadores sin pago (unpaid workers)

peao<-sdem[which(sdem$clase2==1),]
table(peao$pos_ocu)

## 
##      1      2      3      4 
## 104674   7284  30775   5103

But, what if we want the names of the categories instead if numbers?

peao$pos_ocu<-factor(peao$pos_ocu, 
            levels=c(1,2,3,4),
            labels=c("Employees","Employers","Independent","Unpaid"))
table(peao$pos_ocu)

## 
##   Employees   Employers Independent      Unpaid 
##      104674        7284       30775        5103

table(peao$pos_ocu,peao$sex)

##              
##                   1     2
##   Employees   62752 41922
##   Employers    5666  1618
##   Independent 17798 12977
##   Unpaid       2073  3030

Now, we can do the same with sex:

peao$sex<-factor(peao$sex,
                 levels=c(1,2),
                 labels=c("Men","Women"))
table(peao$pos_ocu,peao$sex)

##              
##                 Men Women
##   Employees   62752 41922
##   Employers    5666  1618
##   Independent 17798 12977
##   Unpaid       2073  3030

prop.table(table(peao$pos_ocu,peao$sex),margin=2)*100

##              
##                     Men     Women
##   Employees   71.075672 70.401532
##   Employers    6.417561  2.717181
##   Independent 20.158797 21.792869
##   Unpaid       2.347971  5.088418

chisq.test(table(peao$pos_ocu,peao$sex))

## 
##  Pearson's Chi-squared test
## 
## data:  table(peao$pos_ocu, peao$sex)
## X-squared = 1809.9, df = 3, p-value < 2.2e-16

Now, we can explore access to Social Security:

peao$seg_soc<-factor(peao$seg_soc,
                     levels=c(1,2,3),
                     labels=c("With","Without","Unsp"))
prop.table(table(peao$seg_soc,peao$sex),margin=2)*100

##          
##                  Men      Women
##   With    42.6825539 44.5614389
##   Without 56.5902887 54.8222413
##   Unsp     0.7271574  0.6163199

Now, explore income level:

peao$ing7c<-factor(peao$ing7c,
                   levels=c(1,2,3,4,5,6,7),
                   labels=c("1SM","1to2SM","2to3SM","3to5SM",
                            "More5SM","NoIncome","Unsp"))
prop.table(table(peao$ing7c,peao$sex),margin=2)*100

##           
##                  Men     Women
##   1SM      18.579891 30.669891
##   1to2SM   37.986612 34.512234
##   2to3SM   15.803781 10.862008
##   3to5SM    7.655540  5.362151
##   More5SM   2.923354  1.649118
##   NoIncome  3.480615  5.180782
##   Unsp     13.570207 11.763817

The survey also has information in continuous version:

summary(peao$hrsocup)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   30.00   44.00   40.28   50.00  168.00

summary(peao$ingocup)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0    4500    5425    7740  245100

summary(peao$ing_x_hrs)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00   24.92   34.00   41.67 6201.55

If we want to obtain indicators by gender, we have to use a library… well… we don’t have to… but it is the easier way:

library(doBy)

## Registered S3 methods overwritten by 'tibble':
##   method     from  
##   format.tbl pillar
##   print.tbl  pillar

summaryBy(hrsocup~sex,data=peao,FUN=mean)

##     sex hrsocup.mean
## 1   Men     43.33586
## 2 Women     35.75433

summaryBy(ingocup~sex,data=peao,FUN=mean)

##     sex ingocup.mean
## 1   Men     5901.897
## 2 Women     4718.569

summaryBy(ing_x_hrs~sex,data=peao,FUN=mean)

##     sex ing_x_hrs.mean
## 1   Men       33.79477
## 2 Women       34.30119

And if we want to see if these values are in fact, different:

t.test(hrsocup~sex,data=peao)

## 
##  Welch Two Sample t-test
## 
## data:  hrsocup by sex
## t = 76.248, df = 127292, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  7.386649 7.776420
## sample estimates:
##   mean in group Men mean in group Women 
##            43.33586            35.75433

t.test(ingocup~sex,data=peao)

## 
##  Welch Two Sample t-test
## 
## data:  ingocup by sex
## t = 35.636, df = 144053, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1118.245 1248.411
## sample estimates:
##   mean in group Men mean in group Women 
##            5901.897            4718.569

t.test(ing_x_hrs~sex,data=peao)

## 
##  Welch Two Sample t-test
## 
## data:  ing_x_hrs by sex
## t = -1.4243, df = 129399, p-value = 0.1544
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.2033296  0.1904894
## sample estimates:
##   mean in group Men mean in group Women 
##            33.79477            34.30119

Now, this is a survey, a representative survey… which means that we should be capable to estimate information for the country.

If we just go with a table:

table(peao$pos_ocu,peao$sex)

##              
##                 Men Women
##   Employees   62752 41922
##   Employers    5666  1618
##   Independent 17798 12977
##   Unpaid       2073  3030

We are obtaining information for the sample, we can’t assume this is the same for the country. For instance, if we estimated the proportions, can we say these proportions are the same at the national level?

If were on the context of SRS, we could use these percentages. H However, INEGI uses a more complex sampling approach.

The first key varible is factor.But we need a new library:

library(questionr)
wtd.table(peao$pos_ocu,peao$sex,weights=peao$fac_tri)

##                  Men    Women
## Employees   22421744 13905588
## Employers    2046667   544798
## Independent  7264378  4680928
## Unpaid        917523  1157894

We have some problems to perform, basi t-tests in weighted mode.

We see this better if we use some more interesting approach.

Lets fit a linear regression to income per hour, we will use as predictors the variables sex, eda, pos_ocu… The first model is easy:

model1<-lm(ing_x_hrs~sex+eda+pos_ocu,data=peao)
summary(model1)

## 
## Call:
## lm(formula = ing_x_hrs ~ sex + eda + pos_ocu, data = peao)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -53.1  -32.9   -8.5    7.0 6166.5 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         31.17645    0.55258  56.420  < 2e-16 ***
## sexWomen             2.01327    0.35700   5.639 1.71e-08 ***
## eda                  0.05898    0.01304   4.524 6.07e-06 ***
## pos_ocuEmployers    14.81179    0.82195  18.020  < 2e-16 ***
## pos_ocuIndependent   0.57878    0.44827   1.291    0.197    
## pos_ocuUnpaid      -34.35715    0.96309 -35.674  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.91 on 147830 degrees of freedom
## Multiple R-squared:  0.01176,    Adjusted R-squared:  0.01172 
## F-statistic: 351.8 on 5 and 147830 DF,  p-value: < 2.2e-16

Now, we will use the weighted approach:

model1wt<-lm(ing_x_hrs~sex+eda+pos_ocu,data=peao,
             weights=fac_tri)
summary(model1wt)

## 
## Call:
## lm(formula = ing_x_hrs ~ sex + eda + pos_ocu, data = peao, weights = fac_tri)
## 
## Weighted Residuals:
##    Min     1Q Median     3Q    Max 
##  -6125   -339    -72    137 174854 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         29.56213    0.50534  58.499  < 2e-16 ***
## sexWomen             3.15277    0.33008   9.552  < 2e-16 ***
## eda                  0.02211    0.01193   1.853 0.063845 .  
## pos_ocuEmployers    12.92165    0.75804  17.046  < 2e-16 ***
## pos_ocuIndependent  -1.37278    0.40113  -3.422 0.000621 ***
## pos_ocuUnpaid      -32.04295    0.83270 -38.481  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1161 on 147830 degrees of freedom
## Multiple R-squared:  0.01281,    Adjusted R-squared:  0.01278 
## F-statistic: 383.7 on 5 and 147830 DF,  p-value: < 2.2e-16

Now, the most appopriate way to do this:

library(survey)

## Loading required package: grid

## Loading required package: Matrix

## Loading required package: survival

## 
## Attaching package: 'survey'

## The following object is masked from 'package:graphics':
## 
##     dotchart

design<-svydesign(id=~upm,weights=~fac_tri,strata=~est_d_tri,data=peao,
                  lonely.psu="certainty")
options(survey.lonely.psu="adjust")
model1d<-svyglm(ing_x_hrs~sex+eda+pos_ocu,design=design,family=gaussian)
summary(model1d)

## 
## Call:
## svyglm(formula = ing_x_hrs ~ sex + eda + pos_ocu, design = design, 
##     family = gaussian)
## 
## Survey design:
## svydesign(id = ~upm, weights = ~fac_tri, strata = ~est_d_tri, 
##     data = peao, lonely.psu = "certainty")
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         29.56213    0.71644  41.262  < 2e-16 ***
## sexWomen             3.15277    0.58766   5.365  8.2e-08 ***
## eda                  0.02211    0.01885   1.173   0.2408    
## pos_ocuEmployers    12.92165    1.44152   8.964  < 2e-16 ***
## pos_ocuIndependent  -1.37278    0.81272  -1.689   0.0912 .  
## pos_ocuUnpaid      -32.04295    0.38757 -82.676  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 3763.835)
## 
## Number of Fisher Scoring iterations: 2

Now, we are going to merge the coe1 with the peao… For this, we need to indicate R the variables that identify each observation:

coe1<-read.csv("ENOEN_COE1T121.csv")
peao<-merge(peao,coe1,by=c("cd_a","ent","con","v_sel","tipo","mes_cal",
                           "n_hog","h_mud","n_ren"))
remove(sdem,coe1)
names(peao)

##   [1] "cd_a"        "ent"         "con"         "v_sel"       "tipo"       
##   [6] "mes_cal"     "n_hog"       "h_mud"       "n_ren"       "r_def.x"    
##  [11] "loc"         "mun"         "est"         "est_d_tri"   "est_d_men"  
##  [16] "ageb"        "t_loc_tri"   "t_loc_men"   "upm.x"       "d_sem.x"    
##  [21] "n_pro_viv.x" "n_ent.x"     "per.x"       "c_res"       "par_c"      
##  [26] "sex"         "eda.x"       "nac_dia"     "nac_mes"     "nac_anio"   
##  [31] "l_nac_c"     "cs_p12"      "cs_p13_1"    "cs_p13_2"    "cs_p14_c"   
##  [36] "cs_p15"      "cs_p16"      "cs_p17"      "n_hij"       "e_con"      
##  [41] "cs_ad_mot"   "cs_p20_des"  "cs_ad_des"   "cs_nr_mot"   "cs_p22_des" 
##  [46] "cs_nr_ori"   "ur.x"        "zona"        "salario"     "fac_tri.x"  
##  [51] "fac_men.x"   "clase1"      "clase2"      "clase3"      "pos_ocu"    
##  [56] "seg_soc"     "rama"        "c_ocu11c"    "ing7c"       "dur9c"      
##  [61] "emple7c"     "medica5c"    "buscar5c"    "rama_est1"   "rama_est2"  
##  [66] "dur_est"     "ambito1"     "ambito2"     "tue1"        "tue2"       
##  [71] "tue3"        "busqueda"    "d_ant_lab"   "d_cexp_est"  "dur_des"    
##  [76] "sub_o"       "s_clasifi"   "remune2c"    "pre_asa"     "tip_con"    
##  [81] "dispo"       "nodispo"     "c_inac5c"    "pnea_est"    "niv_ins"    
##  [86] "eda5c"       "eda7c"       "eda12c"      "eda19c"      "hij5c"      
##  [91] "domestico"   "anios_esc"   "hrsocup"     "ingocup"     "ing_x_hrs"  
##  [96] "tpg_p8a"     "tcco"        "cp_anoc"     "imssissste"  "ma48me1sm"  
## [101] "p14apoyos"   "scian"       "t_tra"       "emp_ppal"    "tue_ppal"   
## [106] "trans_ppal"  "mh_fil2"     "mh_col"      "sec_ins"     "ca.x"       
## [111] "r_def.y"     "upm.y"       "d_sem.y"     "n_pro_viv.y" "n_ent.y"    
## [116] "per.y"       "eda.y"       "n_inf"       "p1"          "p1a1"       
## [121] "p1a2"        "p1a3"        "p1b"         "p1c"         "p1d"        
## [126] "p1e"         "p2_1"        "p2_2"        "p2_3"        "p2_4"       
## [131] "p2_9"        "p2a_dia"     "p2a_sem"     "p2a_mes"     "p2a_anio"   
## [136] "p2b_dia"     "p2b_sem"     "p2b_mes"     "p2b_anio"    "p2b"        
## [141] "p2c"         "p2d1"        "p2d2"        "p2d3"        "p2d4"       
## [146] "p2d5"        "p2d6"        "p2d7"        "p2d8"        "p2d9"       
## [151] "p2d10"       "p2d11"       "p2d99"       "p2e"         "p2f"        
## [156] "p2g1"        "p2g2"        "p2h1"        "p2h2"        "p2h3"       
## [161] "p2h4"        "p2h9"        "p3"          "p3a"         "p3b"        
## [166] "p3c1"        "p3c2"        "p3c3"        "p3c4"        "p3c9"       
## [171] "p3d"         "p3e"         "p3f1"        "p3f2"        "p3g1_1"     
## [176] "p3g1_2"      "p3g2_1"      "p3g2_2"      "p3g3_1"      "p3g3_2"     
## [181] "p3g4_1"      "p3g4_2"      "p3g9"        "p3g_tot"     "p3h"        
## [186] "p3i"         "p3j"         "p3k1"        "p3k2"        "p3l1"       
## [191] "p3l2"        "p3l3"        "p3l4"        "p3l5"        "p3l9"       
## [196] "p3m1"        "p3m2"        "p3m3"        "p3m4"        "p3m5"       
## [201] "p3m6"        "p3m7"        "p3m8"        "p3m9"        "p3n"        
## [206] "p3o"         "p3p1"        "p3p2"        "p3q"         "p3r_anio"   
## [211] "p3r_mes"     "p3r"         "p3s"         "p3t_anio"    "p3t_mes"    
## [216] "p4"          "p4_1"        "p4_2"        "p4_3"        "p4a"        
## [221] "p4a_1"       "p4b"         "p4c"         "p4d1"        "p4d2"       
## [226] "p4d3"        "p4e"         "p4f"         "p4g"         "p4h"        
## [231] "p4i"         "p4i_1"       "p5"          "p5a"         "p5b"        
## [236] "p5c_hlu"     "p5c_mlu"     "p5c_hma"     "p5c_mma"     "p5c_hmi"    
## [241] "p5c_mmi"     "p5c_hju"     "p5c_mju"     "p5c_hvi"     "p5c_mvi"    
## [246] "p5c_hsa"     "p5c_msa"     "p5c_hdo"     "p5c_mdo"     "p5c_thrs"   
## [251] "p5c_tdia"    "p5d"         "p5e1"        "p5e_hlu"     "p5e_mlu"    
## [256] "p5e_hma"     "p5e_mma"     "p5e_hmi"     "p5e_mmi"     "p5e_hju"    
## [261] "p5e_mju"     "p5e_hvi"     "p5e_mvi"     "p5e_hsa"     "p5e_msa"    
## [266] "p5e_hdo"     "p5e_mdo"     "p5e_thrs"    "p5e_tdia"    "p5f"        
## [271] "p5g1"        "p5g2"        "p5g3"        "p5g4"        "p5g5"       
## [276] "p5g6"        "p5g7"        "p5g8"        "p5g9"        "p5g10"      
## [281] "p5g11"       "p5g12"       "p5g13"       "p5g14"       "p5g15"      
## [286] "p5g99"       "p5h"         "ur.y"        "ca.y"        "fac_tri.y"  
## [291] "fac_men.y"

Now, I will select the following occupations: - 22: Investigadores y profesionistas en ciencias exactas… - 221: Investigadores y profesionistas en fisica, matematicas, estadistica y actuaria - 222: Investigadores y profesionistas en ciencias biologicas, quimicas y del medio ambiente

table(peao$p3)

## 
## 1111 1112 1113 1121 1122 1131 1132 1133 1134 1135 1211 1212 1221 1222 1223 1224 
##    3    3   39   30  143   16   26    6    2    7  259  139   57   40  311   50 
## 1225 1226 1311 1312 1313 1314 1315 1321 1322 1323 1324 1411 1412 1421 1422 1423 
##    2   26   43    1   10  294  147   16   14   81    2  689  325   39   60   66 
## 1511 1512 1521 1522 1523 1524 1525 1526 1611 1612 1613 1614 1615 1619 1621 1622 
##  538  124  120   95  178  113    2   61   14   14   29  248  239    1   67   51 
## 1623 1624 1629 1711 1712 1721 1722 1723 1999 2111 2112 2113 2121 2122 2123 2131 
##  120   12    8  152   39    9   74   31    8  255  179    5 1282   71   18    7 
## 2132 2133 2134 2135 2141 2142 2143 2144 2145 2151 2152 2153 2161 2162 2163 2164 
##   21   33    2 1173    3  349  138   21   85    8   92   35   35  188    9    4 
## 2171 2172 2173 2174 2175 2211 2212 2221 2222 2223 2231 2232 2233 2241 2242 2251 
##    3  225   27    8    5    2   14   27  165   36   81  134    4   65  157   37 
## 2252 2253 2254 2261 2262 2263 2271 2272 2281 2311 2312 2321 2322 2331 2332 2333 
##   91  235   30  328   40  395  751   38   21  105   83  844  590 1074 1960   27 
## 2334 2335 2339 2341 2342 2343 2391 2411 2412 2413 2421 2422 2423 2424 2425 2426 
##    1  861    3   13   22   83    3  658  316  342    6   24   81   12   15 1045 
## 2427 2428 2511 2512 2513 2514 2521 2522 2523 2524 2531 2532 2533 2541 2542 2543 
##  109   47  635 1202   67   63  102   13   37   37  317  117   26   25   24  203 
## 2544 2551 2552 2553 2561 2562 2563 2611 2612 2613 2614 2621 2622 2623 2624 2625 
##   82   31   44    4   11   40    8   20   93   25   49   64   70   13   55  100 
## 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 
##  163   68 1639  424  512   13   36  270   88    5  105  151  833  333  365   45 
## 2646 2651 2652 2653 2654 2655 2661 2662 2711 2712 2713 2714 2715 2716 2811 2812 
##  198  529   25   63   64  128    5   32  262  107   11   81  270  197  337  150 
## 2813 2814 2815 2816 2817 2821 2822 2823 2824 2825 2826 2827 2991 2992 3101 3111 
##   69   16   49   10    9  348   86  156    2   95   24   97    4    3  349 1780 
## 3112 3113 3114 3115 3121 3122 3131 3132 3141 3142 3201 3211 3212 3213 3221 3222 
##   12  464   35 3826 1992  382  233 1558   46   69   56  994  232  128   63   53 
## 3231 3232 3999 4111 4201 4211 4212 4213 4214 4221 4222 4223 4224 4231 4232 4233 
##   82   14    4 7067  762 8514  339  100  564  997  621  286 1365  347   30    2 
## 4311 4312 4999 5101 5111 5112 5113 5114 5115 5116 5201 5211 5212 5213 5221 5222 
##   47  105    5  326 1144  977   38 2425  151 1253   41 1144  420   66  242  589 
## 5231 5241 5242 5251 5252 5253 5254 5301 5311 5312 5313 5314 5401 5411 5412 5413 
##   11  354  328   10    1   24   75   99   57  803 2391  113   10    3   53  170 
## 5999 6101 6111 6112 6113 6114 6115 6116 6117 6121 6122 6123 6124 6125 6126 6127 
##    2  143 2733  531  181  569   22  314  133  628  106  131  214   39   23   11 
## 6128 6129 6131 6201 6211 6212 6213 6221 6222 6223 6224 6225 6226 6227 6231 6311 
##   12    3  337   11  312   30   26   15    2  136   14   29    8   29    5  159 
## 6999 7101 7111 7112 7113 7121 7122 7123 7131 7132 7133 7134 7135 7201 7211 7212 
##    2  452   32   41   13 4078    9   37  114  114  256  594  535   23  127  360 
## 7213 7214 7221 7222 7223 7301 7311 7312 7313 7321 7322 7323 7331 7332 7341 7342 
##  493   32  955   86   47   25  945   46   43   82  182    4   42  188  874  119 
## 7343 7344 7351 7352 7353 7401 7411 7412 7501 7511 7512 7513 7514 7515 7516 7517 
##  171    5  114   49  255    7  233   67   59  682  111 2253   66  366    7   78 
## 7601 7611 7612 7613 7614 7999 8101 8111 8112 8113 8114 8121 8122 8123 8131 8132 
##    5   72  285   63   13   29  887   52   32   67   17  104  124  818  108   37 
## 8133 8134 8135 8141 8142 8143 8144 8145 8151 8152 8153 8154 8155 8161 8162 8163 
## 1052   19   22   57   53   36   10  114   69   60  761  341  101  251    4  110 
## 8171 8172 8173 8181 8199 8201 8211 8212 8301 8311 8321 8322 8323 8324 8331 8341 
##   36   73   52   20   35  692 1091 1663   65   16   17   12    4   13   11 2901 
## 8342 8343 8344 8349 8351 8352 8999 9111 9112 9113 9121 9122 9124 9211 9212 9221 
## 3135  138  792    1  224  630   55 3418  473   76   54   29   30   35   22 3689 
## 9222 9231 9232 9233 9234 9235 9236 9237 9239 9311 9312 9321 9322 9331 9332 9411 
##   95 1477  390  374  545  400 1028  121    6   73    6   20    2  767    2 1360 
## 9511 9512 9521 9601 9611 9621 9622 9623 9624 9631 9632 9633 9641 9642 9643 9651 
##   30 1560 2075  164 4656 2893  195  235   23  359   82   73  207   52  136   38 
## 9661 9662 9663 9711 9712 9713 9721 9722 9723 9731 9899 9999 
##  389  107    5   43  116   59   21  243  100   37   10  162

peao$p3<-as.numeric(as.character(peao$p3))
peao$ciencias<-ifelse(peao$p3>=2200 & peao$p3<2300,1,0)
table(peao$ciencias)

## 
##      0      1 
## 145185   2651

Seleccionamis solo los que pertenecen a ciencias, y tambien a los menores de 59 anios, edad de retiro…

ciencias<-peao[which(peao$ciencias==1),]
remove(peao)
ciencias<-ciencias[which(ciencias$eda.x<=59),]

Now, we are going to select some columns (we dont them all)

vector<-c("cd_a","ent","con","v_sel","tipo","mes_cal","n_hog","h_mud",
          "n_ren","upm.x","fac_tri.x","est_d_tri","eda.x","sex"
          ,"pos_ocu","e_con","salario","seg_soc","hrsocup","ingocup",
          "ing_x_hrs","emp_ppal","p3","p3a","p3m4","p3t_anio",
          "p3t_mes")
ciencias<-ciencias[vector]
remove(vector)

And now, lets merge coe2

coe2<-read.csv("ENOEN_COE2T121.csv")
ciencias<-merge(ciencias,coe2,by=c("cd_a","ent","con","v_sel","tipo",
                                   "mes_cal", "n_hog","h_mud","n_ren"))
remove(coe2)

We are going to construct some variables we will need to our private pension plan:

First: Population by salary in times the minimum wage:

ciencias$salariom<-0
ciencias$salariom<-ifelse(ciencias$p6b1==1,ciencias$p6b2,ciencias$salariom)
ciencias$salariom<-ifelse(ciencias$p6b1==2,ciencias$p6b2*2,ciencias$salariom)
ciencias$salariom<-ifelse(ciencias$p6b1==3,(ciencias$p6b2/7)*30,ciencias$salariom)
ciencias$salariom<-ifelse(ciencias$p6b1==4,ciencias$p6b2*30,ciencias$salariom)
ciencias$salariom<-ifelse(ciencias$p6b1>=5,-1,ciencias$salariom)
ciencias$salariom<-ifelse(ciencias$p6b2==999999,-1,ciencias$salariom)
ciencias<-ciencias[which(ciencias$salariom>0),]

ciencias$salariom<-ciencias$salariom / ciencias$salario

ciencias$tsm<--1
ciencias$tsm<-ifelse(ciencias$salariom>=0 & ciencias$salariom<1,0,ciencias$tsm)
ciencias$tsm<-ifelse(ciencias$salariom>=1 & ciencias$salariom<2,1,ciencias$tsm)
ciencias$tsm<-ifelse(ciencias$salariom>=2 & ciencias$salariom<3,2,ciencias$tsm)
ciencias$tsm<-ifelse(ciencias$salariom>=3 & ciencias$salariom<4,3,ciencias$tsm)
ciencias$tsm<-ifelse(ciencias$salariom>=4 & ciencias$salariom<5,4,ciencias$tsm)
ciencias$tsm<-ifelse(ciencias$salariom>=5 & ciencias$salariom<6,5,ciencias$tsm)
ciencias$tsm<-ifelse(ciencias$salariom>=6 & ciencias$salariom<7,6,ciencias$tsm)
ciencias$tsm<-ifelse(ciencias$salariom>=7 & ciencias$salariom<8,7,ciencias$tsm)
ciencias$tsm<-ifelse(ciencias$salariom>=8 & ciencias$salariom<9,8,ciencias$tsm)
ciencias$tsm<-ifelse(ciencias$salariom>=9 & ciencias$salariom<10,9,ciencias$tsm)
ciencias$tsm<-ifelse(ciencias$salariom>=10 & ciencias$salariom<15,10,ciencias$tsm)
ciencias$tsm<-ifelse(ciencias$salariom>=15 & ciencias$salariom<20,15,ciencias$tsm)
ciencias$tsm<-ifelse(ciencias$salariom>=20 &    ciencias$salariom<25,20,ciencias$tsm)
ciencias$tsm<-ifelse(ciencias$salariom>=25,25,ciencias$tsm)

table(ciencias$tsm)

## 
##   0   1   2   3   4   5   6   7   8   9  10  15  20  25 
##  25  73 132 166 201 132 104 100  89  68 168  62  27  42

And now, for age groups:

ciencias$gedad<-0
ciencias$gedad<-ifelse(ciencias$eda<=20,20,ciencias$gedad)
ciencias$gedad<-ifelse(ciencias$eda>=21 & ciencias$eda<=25,25,ciencias$gedad)
ciencias$gedad<-ifelse(ciencias$eda>=26 & ciencias$eda<=30,30,ciencias$gedad)
ciencias$gedad<-ifelse(ciencias$eda>=31 & ciencias$eda<=35,35,ciencias$gedad)
ciencias$gedad<-ifelse(ciencias$eda>=36 & ciencias$eda<=40,40,ciencias$gedad)
ciencias$gedad<-ifelse(ciencias$eda>=41 & ciencias$eda<=45,45,ciencias$gedad)
ciencias$gedad<-ifelse(ciencias$eda>=46 & ciencias$eda<=50,50,ciencias$gedad)
ciencias$gedad<-ifelse(ciencias$eda>=51 & ciencias$eda<=55,55,ciencias$gedad)
ciencias$gedad<-ifelse(ciencias$eda>=56 & ciencias$eda<=60,60,ciencias$gedad)
ciencias$gedad<-ifelse(ciencias$eda>=61 & ciencias$eda<=65,65,ciencias$gedad)
ciencias$gedad<-ifelse(ciencias$eda>=66,66,ciencias$gedad)

table(ciencias$gedad)

## 
##  25  30  35  40  45  50  55  60 
## 199 294 301 178 159 117  86  55

Since we created new variables and now we have a new dataset we have to define the design of the survey again.

We can explore some option:

design<-svydesign(id=~upm.x,weights=~fac_tri.x,strata=~est_d_tri,
                  data=ciencias,lonely.psu="certainty")
options(survey.lonely.psu="adjust")
svytable(~sex, design=design)

## sex
##    Men  Women 
## 301020  80984

svytable(~gedad+sex, design=design)

##      sex
## gedad   Men Women
##    25 44140 15099
##    30 54392 23865
##    35 56760 21266
##    40 42205  6912
##    45 37565  5477
##    50 30863  5818
##    55 19118  2328
##    60 15977   219

svychisq(~gedad+sex, design=design)

## 
##  Pearson's X^2: Rao & Scott adjustment
## 
## data:  svychisq(~gedad + sex, design = design)
## F = 3.6408, ndf = 5.6789, ddf = 5213.2474, p-value = 0.001659

svychisq(~tsm+sex, design=design)

## 
##  Pearson's X^2: Rao & Scott adjustment
## 
## data:  svychisq(~tsm + sex, design = design)
## F = 2.065, ndf = 9.4485, ddf = 8673.7399, p-value = 0.02657

svymean(~eda,design=design)

##       mean     SE
## eda 35.935 0.4902

svymean(~salariom,design=design)

##            mean     SE
## salariom 8.3299 0.4479

model<-svyglm(eda~sex,family="gaussian",design=design)
summary(model)

## 
## Call:
## svyglm(formula = eda ~ sex, design = design, family = "gaussian")
## 
## Survey design:
## svydesign(id = ~upm.x, weights = ~fac_tri.x, strata = ~est_d_tri, 
##     data = ciencias, lonely.psu = "certainty")
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.8510     0.5889  62.579  < 2e-16 ***
## sexWomen     -4.3198     0.9154  -4.719 2.74e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 91.93228)
## 
## Number of Fisher Scoring iterations: 2

model<-svyglm(salariom~sex,family="gaussian",design=design)
summary(model)

## 
## Call:
## svyglm(formula = salariom ~ sex, design = design, family = "gaussian")
## 
## Survey design:
## svydesign(id = ~upm.x, weights = ~fac_tri.x, strata = ~est_d_tri, 
##     data = ciencias, lonely.psu = "certainty")
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9408     0.5404  16.544  < 2e-16 ***
## sexWomen     -2.8816     0.6805  -4.234 2.52e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 125.8619)
## 
## Number of Fisher Scoring iterations: 2

LAT4072_ENOE

Mauricio Rodriguez Abreu

3/4/2022

R Markdown