About ENOE survey

The Encuesta Nacional de Ocupación y Empleo (ENOE), or National Survey of Occupation and Employment, is a survey conducted by Mexico’s National Institute of Statistics and Geography (INEGI) to gather data on the country’s labor market. It collects information on employment, unemployment, underemployment, labor informality, and other characteristics of the population aged 15 years and older.

The ENOE is a continuous survey, meaning it is conducted year-round. Data is collected monthly, allowing for the generation of quarterly and annual statistics.

Description of the database

enoe_raw <- read.csv("enoe_2024_trim2_csv/ENOE_SDEMT224.csv")
dim(enoe_raw)
## [1] 427123    114

For this final project, I will analyze the results of the ENOE survey collected in the second quarter of 2024. After accessing the publicly available microdata for the socio-demographic questionnaire ( https://www.inegi.org.mx/programas/enoe/15ymas/#microdatos ), we downloaded a dataset with 114 variables and 427,123 observations. Each observation correspond to one person interviewed. Data cleaning and variable selection will be crucial steps in my analysis to ensure meaningful results.

1. Description of variables and data cleaning

For this project, the purpose of my analysis will be to find the most important variables that affect the hourly wage. Using the survey’s documentation as guide ( https://www.inegi.org.mx/rnm/index.php/catalog/907/data-dictionary/F42), I have choose to work with 13 variables and valid observations:

r_def - Categorical - Interview results. 0 if successful

t_loc_trim - Categorical - Population size of a locality

ent - Categorical - State (Mexican administrative subdivisions)

sex - Categorical - Sex

Clase2 - Categorical - Classification of the employed and unemployed population; available and not available [for work].

seg_soc - Categorical - Access to social security

emp_ppal - Categorical - Classification of formal and informal employment

eda - Numerical - Age

anios_esc - Numerical - Year of schooling

n_hij - Numerical - Number of children (only for women)

hrsocup - Numerical - Weekly work hours

ingocup - Numerical - Monthly wage

ing_x_hrs - Numerical - Hourly wage

#First, we select the chosen variables
library(dplyr)
## 
## Adjuntando el paquete: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
enoe_clean <- select(enoe_raw, r_def, t_loc_tri, ent, sex, clase2, seg_soc, emp_ppal, eda, anios_esc, n_hij, hrsocup, ingocup, ing_x_hrs)

#Next, we chose only valid values according survey's documentation.

# 1. We take only valid interviews (r_def == 0)
# 2. We take only employed population (clase2 == 1)
# 3. We delete those values without information about social security access (seg_soc == 3)
# 4. We delete those values without information about age (eda >= 98)
# 5. We delete those values without information about schooling (anios_esc < 99)
# 6. We delete those values without information about number of children (n_hij == 99 or NA if there are men)
# 7. We delete those values without an hourly wage (ing_x_hrs == 0) 


enoe_clean <- filter(enoe_clean, 
                     r_def == 0,
                     clase2 == 1, 
                     seg_soc != 3, 
                     eda >=15 & eda < 98,
                     anios_esc != 99,
                     n_hij != 99 | is.na(n_hij),
                     ing_x_hrs != 0)

Summary of database

str(enoe_clean)
## 'data.frame':    126843 obs. of  13 variables:
##  $ r_def    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ t_loc_tri: int  1 1 1 1 1 1 1 2 1 1 ...
##  $ ent      : int  9 9 9 9 9 9 9 9 9 9 ...
##  $ sex      : int  2 2 2 1 1 1 2 1 1 1 ...
##  $ clase2   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ seg_soc  : int  1 2 2 1 1 2 1 2 2 2 ...
##  $ emp_ppal : int  2 1 1 2 2 1 2 1 1 1 ...
##  $ eda      : int  33 24 24 25 24 57 29 38 34 51 ...
##  $ anios_esc: int  13 6 17 15 12 9 16 11 16 9 ...
##  $ n_hij    : int  2 0 0 NA NA NA 0 NA NA NA ...
##  $ hrsocup  : int  30 66 30 27 54 40 54 30 48 60 ...
##  $ ingocup  : int  20000 5160 10000 24000 9000 3870 12000 8600 12000 9460 ...
##  $ ing_x_hrs: num  155 18.2 77.5 206.7 38.8 ...

After cleaning the dataset, we were left with 126,843 observations and 13 variables. Initially, the categorical variables are represented as integers due to survey coding practices. To improve clarity and facilitate analysis, we need convert these integer values into meaningful categorical factors.

mex_states <- c("Aguascalientes", "Baja California", "Baja California Sur", "Campeche", "Coahuila", "Colima", "Chiapas", "Chihuahua", "Ciudad de México", "Durango", "Guanajuato",  "Guerrero", "Hidalgo", "Jalisco", "Estado de México", "Michoacán", "Morelos", "Nayarit", "Nuevo León", "Oaxaca", "Puebla", "Querétaro", "Quintana Roo", "San Luis Potosí", "Sinaloa", "Sonora", "Tabasco", "Tamaulipas", "Tlaxcala", "Veracruz", "Yucatán", "Zacatecas")

enoe_clean <- mutate(enoe_clean, 
                     t_loc_tri = factor(t_loc_tri, levels = 1:4, labels = c("100k+ hab.", "15k-100 hab.", "2.5k-15k hab.", "<2.5k hab.")), 
                     ent = factor (ent, levels = 1:32, labels = mex_states), 
                     sex = factor (sex, levels = c(1, 2), labels = c("man", "woman")),
                     seg_soc = factor(seg_soc, levels = c(1, 2), labels = c("Yes", "No")), 
                     emp_ppal = factor(emp_ppal, levels = c(1, 2), labels = c("Formal", "Informal")))

str(enoe_clean)
## 'data.frame':    126843 obs. of  13 variables:
##  $ r_def    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ t_loc_tri: Factor w/ 4 levels "100k+ hab.","15k-100 hab.",..: 1 1 1 1 1 1 1 2 1 1 ...
##  $ ent      : Factor w/ 32 levels "Aguascalientes",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ sex      : Factor w/ 2 levels "man","woman": 2 2 2 1 1 1 2 1 1 1 ...
##  $ clase2   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ seg_soc  : Factor w/ 2 levels "Yes","No": 1 2 2 1 1 2 1 2 2 2 ...
##  $ emp_ppal : Factor w/ 2 levels "Formal","Informal": 2 1 1 2 2 1 2 1 1 1 ...
##  $ eda      : int  33 24 24 25 24 57 29 38 34 51 ...
##  $ anios_esc: int  13 6 17 15 12 9 16 11 16 9 ...
##  $ n_hij    : int  2 0 0 NA NA NA 0 NA NA NA ...
##  $ hrsocup  : int  30 66 30 27 54 40 54 30 48 60 ...
##  $ ingocup  : int  20000 5160 10000 24000 9000 3870 12000 8600 12000 9460 ...
##  $ ing_x_hrs: num  155 18.2 77.5 206.7 38.8 ...

As we chose to analyze only successful interviews of the employed population, we can delete r_def and Clase2 columns with only one possible value.

enoe_clean <- select(enoe_clean, t_loc_tri, ent, sex, seg_soc, emp_ppal, eda, anios_esc, n_hij, hrsocup, ingocup, ing_x_hrs)
str(enoe_clean)
## 'data.frame':    126843 obs. of  11 variables:
##  $ t_loc_tri: Factor w/ 4 levels "100k+ hab.","15k-100 hab.",..: 1 1 1 1 1 1 1 2 1 1 ...
##  $ ent      : Factor w/ 32 levels "Aguascalientes",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ sex      : Factor w/ 2 levels "man","woman": 2 2 2 1 1 1 2 1 1 1 ...
##  $ seg_soc  : Factor w/ 2 levels "Yes","No": 1 2 2 1 1 2 1 2 2 2 ...
##  $ emp_ppal : Factor w/ 2 levels "Formal","Informal": 2 1 1 2 2 1 2 1 1 1 ...
##  $ eda      : int  33 24 24 25 24 57 29 38 34 51 ...
##  $ anios_esc: int  13 6 17 15 12 9 16 11 16 9 ...
##  $ n_hij    : int  2 0 0 NA NA NA 0 NA NA NA ...
##  $ hrsocup  : int  30 66 30 27 54 40 54 30 48 60 ...
##  $ ingocup  : int  20000 5160 10000 24000 9000 3870 12000 8600 12000 9460 ...
##  $ ing_x_hrs: num  155 18.2 77.5 206.7 38.8 ...

After all these process, we have a dataframe with 126843 observations and 11 variables (5 categorical and 6 numerical).

Descriptive Statistics for each variable

#Print summary
summary(enoe_clean)
##          t_loc_tri             ent           sex        seg_soc    
##  100k+ hab.   :73856   Chiapas   : 6275   man  :73299   Yes:54160  
##  15k-100 hab. :16842   Campeche  : 6124   woman:53544   No :72683  
##  2.5k-15k hab.:16692   Coahuila  : 6081                            
##  <2.5k hab.   :19453   Tamaulipas: 5358                            
##                        Veracruz  : 5164                            
##                        Nayarit   : 4880                            
##                        (Other)   :92961                            
##      emp_ppal          eda          anios_esc         n_hij      
##  Formal  :66126   Min.   :15.00   Min.   : 0.00   Min.   : 0.0   
##  Informal:60717   1st Qu.:28.00   1st Qu.: 9.00   1st Qu.: 0.0   
##                   Median :39.00   Median :10.00   Median : 2.0   
##                   Mean   :39.98   Mean   :10.66   Mean   : 1.9   
##                   3rd Qu.:50.00   3rd Qu.:13.00   3rd Qu.: 3.0   
##                   Max.   :97.00   Max.   :24.00   Max.   :25.0   
##                                                   NA's   :73299  
##     hrsocup          ingocup         ing_x_hrs       
##  Min.   :  1.00   Min.   :    15   Min.   :   0.066  
##  1st Qu.: 32.00   1st Qu.:  5590   1st Qu.:  33.036  
##  Median : 45.00   Median :  8600   Median :  46.512  
##  Mean   : 41.96   Mean   :  9917   Mean   :  62.994  
##  3rd Qu.: 50.00   3rd Qu.: 12040   3rd Qu.:  70.000  
##  Max.   :168.00   Max.   :219000   Max.   :9302.326  
## 
library(pastecs)
## 
## Adjuntando el paquete: 'pastecs'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
options(scipen=100)
options(digits=2)
print(stat.desc(select(enoe_clean, eda, anios_esc, n_hij, hrsocup, ingocup, ing_x_hrs)))
##                      eda   anios_esc       n_hij     hrsocup       ingocup
## nbr.val       126843.000  126843.000  53544.0000  126843.000     126843.00
## nbr.null           0.000    3025.000  14074.0000       0.000          0.00
## nbr.na             0.000       0.000  73299.0000       0.000          0.00
## min               15.000       0.000      0.0000       1.000         15.00
## max               97.000      24.000     25.0000     168.000     219000.00
## range             82.000      24.000     25.0000     167.000     218985.00
## sum          5071441.000 1351731.000 101711.0000 5322112.000 1257869452.00
## median            39.000      10.000      2.0000      45.000       8600.00
## mean              39.982      10.657      1.8996      41.958       9916.74
## SE.mean            0.040       0.012      0.0074       0.047         23.00
## CI.mean.0.95       0.078       0.023      0.0145       0.092         45.08
## var              201.269      17.781      2.9243     279.252   67109089.07
## std.dev           14.187       4.217      1.7101      16.711       8192.01
## coef.var           0.355       0.396      0.9002       0.398          0.83
##                ing_x_hrs
## nbr.val       126843.000
## nbr.null           0.000
## nbr.na             0.000
## min                0.066
## max             9302.326
## range           9302.259
## sum          7990343.724
## median            46.512
## mean              62.994
## SE.mean            0.222
## CI.mean.0.95       0.436
## var             6266.224
## std.dev           79.159
## coef.var           1.257

Overall mean of each variable

stat.desc(select(enoe_clean, eda, anios_esc, n_hij, hrsocup, ingocup, ing_x_hrs))[c("mean"), ]
##      eda anios_esc n_hij hrsocup ingocup ing_x_hrs
## mean  40        11   1.9      42    9917        63

Data visualization

With our data cleaned, we can start to visualize how hourly wage behaves. First, we plot the distribution of hourly wage in our observations.

library(ggplot2)

ggplot(enoe_clean, aes(x = ing_x_hrs)) + 
  geom_histogram(bins = 50) + 
  labs(title = "Hourly wage distribution", x = "Hourly wage (Mexican pesos)", y = "Frecuency")

As we can see, hourly wage is very skewed to the left. Let’s create a new variable where we applied logarithm to hourly wage.

enoe_clean <- mutate(enoe_clean, loghw = log(ing_x_hrs, base = 10))
summary(enoe_clean$loghw)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -1.2     1.5     1.7     1.7     1.8     4.0
ggplot(enoe_clean, aes(x = loghw)) + 
  geom_histogram(binwidth = 0.1) + 
  labs(title = "Hourly wage distribution (log)", x = "Log(Hourly wage in Mexican pesos)", y = "Frecuency")

Categorical variables

barplot(table(enoe_clean$t_loc_tri), main = "Distribution by locality size", xlab ="Locality size", ylab = "Frecuency")

barplot(table(enoe_clean$ent), main = "Distribution by state", ylab = "Frecuency", las = 2)

pie(table(enoe_clean$sex), main = "Distribution by sex", xlab ="Sex", ylab = "Frecuency")

pie(table(enoe_clean$seg_soc), main = "Distribution by access to social security", ylab = "Frecuency")

pie(table(enoe_clean$emp_ppal), main = "Distribution by formality", ylab = "Frecuency")

Numerical variables

Distribution by age and years of schooling

ggplot(enoe_clean, aes(x = eda)) + 
  geom_histogram(binwidth = 5) + 
  labs(title = "Distribution by age", x = "Age (years)", y = "Frecuency")

ggplot(enoe_clean, aes(x = anios_esc)) + 
  geom_histogram(binwidth = 1) + 
  labs(title = "Distribution by years of schooling", x = "Years", y = "Frecuency")

Distribution by number of children (only women). We delete the outliers for our future model’s dataframe.

ggplot(enoe_clean, aes(x = n_hij)) + 
  geom_boxplot() + 
  labs(title = "Distribution by number of children (only women)", x = "Number of children")
## Warning: Removed 73299 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

enoe_for_model <- filter (enoe_clean, n_hij <= 10 | is.na(n_hij))

ggplot(enoe_for_model, aes(x = n_hij)) + 
  geom_boxplot() + 
  labs(title = "Distribution by number of children (only women)", x = "Number of children")
## Warning: Removed 73299 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Weekly work hours Distribution by weekly work hours. We delete the outliers for our future model’s dataframe.

ggplot(enoe_clean, aes(x = hrsocup)) + 
  geom_boxplot() + 
  labs(title = "Distribution by weekly work hours", x = "Hours")

enoe_for_model <-  filter(enoe_clean, hrsocup <= 98)

ggplot(enoe_for_model, aes(x = hrsocup)) + 
  geom_boxplot() + 
  labs(title = "Distribution by weekly work hours", x = "Hours")

Visualization of hourly wage in comparison to other variables

Categorical variables

ggplot(enoe_clean, aes(x = t_loc_tri, y = loghw)) + 
  geom_boxplot() + 
  labs(title = "Hourly wage by locality size", x = "Locality size", y = "Hourly wage")

ggplot(enoe_clean, aes(x = reorder(ent, loghw, median), y = loghw)) + 
  geom_boxplot() + 
  labs(title = "Hourly wage by state", x = "State", y = "Hourly wage") + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

ggplot(enoe_clean, aes(x = reorder(sex, loghw, median), y = loghw)) + 
  geom_boxplot() + 
  labs(title = "Hourly wage by sex", x = "Sex", y = "Hourly wage")

ggplot(enoe_clean, aes(x = reorder(seg_soc, loghw, median), y = loghw)) + 
  geom_boxplot() + 
  labs(title = "Hourly wage by access to social security", x= "Access to social security", y = "Hourly wage")

ggplot(enoe_clean, aes(x = reorder(emp_ppal, loghw, median), y = loghw)) + 
  geom_boxplot() + 
  labs(title = "Hourly wage by formal or type of work", x= "Sector", y = "Hourly wage")

Numerical values

We skip weekly work hours and monthly wage as they are obviously dependent to hourly wage.

ggplot(enoe_clean, aes(x = eda, y = loghw)) + 
  geom_point() + 
  labs(title = "Hourly wage by age", x = "Age (years)", y = "Hourly wage")

ggplot(enoe_clean, aes(x = anios_esc, y = loghw)) + 
  geom_point() + 
  labs(title = "Hourly wage by years of schooling", x = "Years", y = "Hourly wage")

ggplot(enoe_clean, aes(x = n_hij, y = loghw)) + 
  geom_point() + 
  labs(title = "Hourly wage by number of children (only women)", x = "Children", y = "Hourly wage")
## Warning: Removed 73299 rows containing missing values or values outside the scale range
## (`geom_point()`).

Predictive modeling

First, we create a linear regression with the variables of size of locality, sex, type of work (formal/informal), age and years of schooling. For this model, we will use the hourly wage without any aditional operation (no log).

lm_1 <- lm(ing_x_hrs ~ t_loc_tri + sex  +emp_ppal + eda + anios_esc, data=enoe_for_model)
summary(lm_1)
## 
## Call:
## lm(formula = ing_x_hrs ~ t_loc_tri + sex + emp_ppal + eda + anios_esc, 
##     data = enoe_for_model)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -106    -28    -12      9   9231 
## 
## Coefficients:
##                        Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)              0.4109     1.0605    0.39                  0.7    
## t_loc_tri15k-100 hab.   -5.5405     0.6599   -8.40 < 0.0000000000000002 ***
## t_loc_tri2.5k-15k hab.  -8.8872     0.6707  -13.25 < 0.0000000000000002 ***
## t_loc_tri<2.5k hab.     -9.1943     0.6506  -14.13 < 0.0000000000000002 ***
## sexwoman                -6.1808     0.4399  -14.05 < 0.0000000000000002 ***
## emp_ppalInformal         2.2456     0.4730    4.75            0.0000021 ***
## eda                      0.5583     0.0158   35.26 < 0.0000000000000002 ***
## anios_esc                4.2385     0.0586   72.37 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 77 on 126616 degrees of freedom
## Multiple R-squared:  0.0603, Adjusted R-squared:  0.0602 
## F-statistic: 1.16e+03 on 7 and 126616 DF,  p-value: <0.0000000000000002

Now, we try to build a model transforming hourly wage with logarithm base 10. This will make that “estimate” is in terms of percentages insteado of pesos.

lm_2 <- lm(loghw ~ t_loc_tri + sex  +emp_ppal + eda + anios_esc, data=enoe_for_model)
summary(lm_2)
## 
## Call:
## lm(formula = loghw ~ t_loc_tri + sex + emp_ppal + eda + anios_esc, 
##     data = enoe_for_model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7249 -0.1539 -0.0079  0.1511  2.2144 
## 
## Coefficients:
##                          Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)             1.3783898  0.0039090   352.6 <0.0000000000000002 ***
## t_loc_tri15k-100 hab.  -0.0377596  0.0024323   -15.5 <0.0000000000000002 ***
## t_loc_tri2.5k-15k hab. -0.0687288  0.0024723   -27.8 <0.0000000000000002 ***
## t_loc_tri<2.5k hab.    -0.0910131  0.0023980   -38.0 <0.0000000000000002 ***
## sexwoman               -0.0476430  0.0016215   -29.4 <0.0000000000000002 ***
## emp_ppalInformal        0.0706617  0.0017434    40.5 <0.0000000000000002 ***
## eda                     0.0018784  0.0000584    32.2 <0.0000000000000002 ***
## anios_esc               0.0226342  0.0002159   104.8 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.28 on 126616 degrees of freedom
## Multiple R-squared:  0.167,  Adjusted R-squared:  0.167 
## F-statistic: 3.62e+03 on 7 and 126616 DF,  p-value: <0.0000000000000002

In order to be able to understand estimate values of this second model, it is necessary to exponentiate the coefficient. The result is the percentage of the intercept that we could get if other variables are the same.

print(10^lm_2$coefficients)
##            (Intercept)  t_loc_tri15k-100 hab. t_loc_tri2.5k-15k hab. 
##                  23.90                   0.92                   0.85 
##    t_loc_tri<2.5k hab.               sexwoman       emp_ppalInformal 
##                   0.81                   0.90                   1.18 
##                    eda              anios_esc 
##                   1.00                   1.05

Lastly, we could explore the impact of number of children in the hourly wage of women.

lm_3 <- lm(loghw ~ t_loc_tri +emp_ppal + eda + anios_esc + n_hij, data=filter(enoe_for_model, sex == "woman")) 
print(summary(lm_3))
## 
## Call:
## lm(formula = loghw ~ t_loc_tri + emp_ppal + eda + anios_esc + 
##     n_hij, data = filter(enoe_for_model, sex == "woman"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0444 -0.1585 -0.0088  0.1550  2.1979 
## 
## Coefficients:
##                         Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)             1.284405   0.006498  197.66 <0.0000000000000002 ***
## t_loc_tri15k-100 hab.  -0.038743   0.003794  -10.21 <0.0000000000000002 ***
## t_loc_tri2.5k-15k hab. -0.075510   0.003931  -19.21 <0.0000000000000002 ***
## t_loc_tri<2.5k hab.    -0.089100   0.004038  -22.07 <0.0000000000000002 ***
## emp_ppalInformal        0.074612   0.002773   26.91 <0.0000000000000002 ***
## eda                     0.001486   0.000113   13.20 <0.0000000000000002 ***
## anios_esc               0.026717   0.000359   74.38 <0.0000000000000002 ***
## n_hij                   0.008403   0.000950    8.84 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.29 on 53491 degrees of freedom
## Multiple R-squared:  0.182,  Adjusted R-squared:  0.182 
## F-statistic: 1.7e+03 on 7 and 53491 DF,  p-value: <0.0000000000000002
print(10^lm_3$coefficients)
##            (Intercept)  t_loc_tri15k-100 hab. t_loc_tri2.5k-15k hab. 
##                  19.25                   0.91                   0.84 
##    t_loc_tri<2.5k hab.       emp_ppalInformal                    eda 
##                   0.81                   1.19                   1.00 
##              anios_esc                  n_hij 
##                   1.06                   1.02

Correlation plot

library(ggcorrplot)
## Warning: package 'ggcorrplot' was built under R version 4.4.2
corr_tab <- cor(select_if(enoe_for_model, is.numeric), use = "complete.obs")

ggcorrplot(corr_tab, hc.order = TRUE, lab=TRUE)

## Conclusions: