The Encuesta Nacional de Ocupación y Empleo (ENOE), or National Survey of Occupation and Employment, is a survey conducted by Mexico’s National Institute of Statistics and Geography (INEGI) to gather data on the country’s labor market. It collects information on employment, unemployment, underemployment, labor informality, and other characteristics of the population aged 15 years and older.
The ENOE is a continuous survey, meaning it is conducted year-round. Data is collected monthly, allowing for the generation of quarterly and annual statistics.
enoe_raw <- read.csv("enoe_2024_trim2_csv/ENOE_SDEMT224.csv")
dim(enoe_raw)
## [1] 427123 114
For this final project, I will analyze the results of the ENOE survey collected in the second quarter of 2024. After accessing the publicly available microdata for the socio-demographic questionnaire ( https://www.inegi.org.mx/programas/enoe/15ymas/#microdatos ), we downloaded a dataset with 114 variables and 427,123 observations. Each observation correspond to one person interviewed. Data cleaning and variable selection will be crucial steps in my analysis to ensure meaningful results.
For this project, the purpose of my analysis will be to find the most important variables that affect the hourly wage. Using the survey’s documentation as guide ( https://www.inegi.org.mx/rnm/index.php/catalog/907/data-dictionary/F42), I have choose to work with 13 variables and valid observations:
r_def - Categorical - Interview results. 0 if successful
t_loc_trim - Categorical - Population size of a locality
ent - Categorical - State (Mexican administrative subdivisions)
sex - Categorical - Sex
Clase2 - Categorical - Classification of the employed and unemployed population; available and not available [for work].
seg_soc - Categorical - Access to social security
emp_ppal - Categorical - Classification of formal and informal employment
eda - Numerical - Age
anios_esc - Numerical - Year of schooling
n_hij - Numerical - Number of children (only for women)
hrsocup - Numerical - Weekly work hours
ingocup - Numerical - Monthly wage
ing_x_hrs - Numerical - Hourly wage
#First, we select the chosen variables
library(dplyr)
##
## Adjuntando el paquete: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
enoe_clean <- select(enoe_raw, r_def, t_loc_tri, ent, sex, clase2, seg_soc, emp_ppal, eda, anios_esc, n_hij, hrsocup, ingocup, ing_x_hrs)
#Next, we chose only valid values according survey's documentation.
# 1. We take only valid interviews (r_def == 0)
# 2. We take only employed population (clase2 == 1)
# 3. We delete those values without information about social security access (seg_soc == 3)
# 4. We delete those values without information about age (eda >= 98)
# 5. We delete those values without information about schooling (anios_esc < 99)
# 6. We delete those values without information about number of children (n_hij == 99 or NA if there are men)
# 7. We delete those values without an hourly wage (ing_x_hrs == 0)
enoe_clean <- filter(enoe_clean,
r_def == 0,
clase2 == 1,
seg_soc != 3,
eda >=15 & eda < 98,
anios_esc != 99,
n_hij != 99 | is.na(n_hij),
ing_x_hrs != 0)
str(enoe_clean)
## 'data.frame': 126843 obs. of 13 variables:
## $ r_def : int 0 0 0 0 0 0 0 0 0 0 ...
## $ t_loc_tri: int 1 1 1 1 1 1 1 2 1 1 ...
## $ ent : int 9 9 9 9 9 9 9 9 9 9 ...
## $ sex : int 2 2 2 1 1 1 2 1 1 1 ...
## $ clase2 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ seg_soc : int 1 2 2 1 1 2 1 2 2 2 ...
## $ emp_ppal : int 2 1 1 2 2 1 2 1 1 1 ...
## $ eda : int 33 24 24 25 24 57 29 38 34 51 ...
## $ anios_esc: int 13 6 17 15 12 9 16 11 16 9 ...
## $ n_hij : int 2 0 0 NA NA NA 0 NA NA NA ...
## $ hrsocup : int 30 66 30 27 54 40 54 30 48 60 ...
## $ ingocup : int 20000 5160 10000 24000 9000 3870 12000 8600 12000 9460 ...
## $ ing_x_hrs: num 155 18.2 77.5 206.7 38.8 ...
After cleaning the dataset, we were left with 126,843 observations and 13 variables. Initially, the categorical variables are represented as integers due to survey coding practices. To improve clarity and facilitate analysis, we need convert these integer values into meaningful categorical factors.
mex_states <- c("Aguascalientes", "Baja California", "Baja California Sur", "Campeche", "Coahuila", "Colima", "Chiapas", "Chihuahua", "Ciudad de México", "Durango", "Guanajuato", "Guerrero", "Hidalgo", "Jalisco", "Estado de México", "Michoacán", "Morelos", "Nayarit", "Nuevo León", "Oaxaca", "Puebla", "Querétaro", "Quintana Roo", "San Luis Potosí", "Sinaloa", "Sonora", "Tabasco", "Tamaulipas", "Tlaxcala", "Veracruz", "Yucatán", "Zacatecas")
enoe_clean <- mutate(enoe_clean,
t_loc_tri = factor(t_loc_tri, levels = 1:4, labels = c("100k+ hab.", "15k-100 hab.", "2.5k-15k hab.", "<2.5k hab.")),
ent = factor (ent, levels = 1:32, labels = mex_states),
sex = factor (sex, levels = c(1, 2), labels = c("man", "woman")),
seg_soc = factor(seg_soc, levels = c(1, 2), labels = c("Yes", "No")),
emp_ppal = factor(emp_ppal, levels = c(1, 2), labels = c("Formal", "Informal")))
str(enoe_clean)
## 'data.frame': 126843 obs. of 13 variables:
## $ r_def : int 0 0 0 0 0 0 0 0 0 0 ...
## $ t_loc_tri: Factor w/ 4 levels "100k+ hab.","15k-100 hab.",..: 1 1 1 1 1 1 1 2 1 1 ...
## $ ent : Factor w/ 32 levels "Aguascalientes",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ sex : Factor w/ 2 levels "man","woman": 2 2 2 1 1 1 2 1 1 1 ...
## $ clase2 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ seg_soc : Factor w/ 2 levels "Yes","No": 1 2 2 1 1 2 1 2 2 2 ...
## $ emp_ppal : Factor w/ 2 levels "Formal","Informal": 2 1 1 2 2 1 2 1 1 1 ...
## $ eda : int 33 24 24 25 24 57 29 38 34 51 ...
## $ anios_esc: int 13 6 17 15 12 9 16 11 16 9 ...
## $ n_hij : int 2 0 0 NA NA NA 0 NA NA NA ...
## $ hrsocup : int 30 66 30 27 54 40 54 30 48 60 ...
## $ ingocup : int 20000 5160 10000 24000 9000 3870 12000 8600 12000 9460 ...
## $ ing_x_hrs: num 155 18.2 77.5 206.7 38.8 ...
As we chose to analyze only successful interviews of the employed population, we can delete r_def and Clase2 columns with only one possible value.
enoe_clean <- select(enoe_clean, t_loc_tri, ent, sex, seg_soc, emp_ppal, eda, anios_esc, n_hij, hrsocup, ingocup, ing_x_hrs)
str(enoe_clean)
## 'data.frame': 126843 obs. of 11 variables:
## $ t_loc_tri: Factor w/ 4 levels "100k+ hab.","15k-100 hab.",..: 1 1 1 1 1 1 1 2 1 1 ...
## $ ent : Factor w/ 32 levels "Aguascalientes",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ sex : Factor w/ 2 levels "man","woman": 2 2 2 1 1 1 2 1 1 1 ...
## $ seg_soc : Factor w/ 2 levels "Yes","No": 1 2 2 1 1 2 1 2 2 2 ...
## $ emp_ppal : Factor w/ 2 levels "Formal","Informal": 2 1 1 2 2 1 2 1 1 1 ...
## $ eda : int 33 24 24 25 24 57 29 38 34 51 ...
## $ anios_esc: int 13 6 17 15 12 9 16 11 16 9 ...
## $ n_hij : int 2 0 0 NA NA NA 0 NA NA NA ...
## $ hrsocup : int 30 66 30 27 54 40 54 30 48 60 ...
## $ ingocup : int 20000 5160 10000 24000 9000 3870 12000 8600 12000 9460 ...
## $ ing_x_hrs: num 155 18.2 77.5 206.7 38.8 ...
After all these process, we have a dataframe with 126843 observations and 11 variables (5 categorical and 6 numerical).
#Print summary
summary(enoe_clean)
## t_loc_tri ent sex seg_soc
## 100k+ hab. :73856 Chiapas : 6275 man :73299 Yes:54160
## 15k-100 hab. :16842 Campeche : 6124 woman:53544 No :72683
## 2.5k-15k hab.:16692 Coahuila : 6081
## <2.5k hab. :19453 Tamaulipas: 5358
## Veracruz : 5164
## Nayarit : 4880
## (Other) :92961
## emp_ppal eda anios_esc n_hij
## Formal :66126 Min. :15.00 Min. : 0.00 Min. : 0.0
## Informal:60717 1st Qu.:28.00 1st Qu.: 9.00 1st Qu.: 0.0
## Median :39.00 Median :10.00 Median : 2.0
## Mean :39.98 Mean :10.66 Mean : 1.9
## 3rd Qu.:50.00 3rd Qu.:13.00 3rd Qu.: 3.0
## Max. :97.00 Max. :24.00 Max. :25.0
## NA's :73299
## hrsocup ingocup ing_x_hrs
## Min. : 1.00 Min. : 15 Min. : 0.066
## 1st Qu.: 32.00 1st Qu.: 5590 1st Qu.: 33.036
## Median : 45.00 Median : 8600 Median : 46.512
## Mean : 41.96 Mean : 9917 Mean : 62.994
## 3rd Qu.: 50.00 3rd Qu.: 12040 3rd Qu.: 70.000
## Max. :168.00 Max. :219000 Max. :9302.326
##
library(pastecs)
##
## Adjuntando el paquete: 'pastecs'
## The following objects are masked from 'package:dplyr':
##
## first, last
options(scipen=100)
options(digits=2)
print(stat.desc(select(enoe_clean, eda, anios_esc, n_hij, hrsocup, ingocup, ing_x_hrs)))
## eda anios_esc n_hij hrsocup ingocup
## nbr.val 126843.000 126843.000 53544.0000 126843.000 126843.00
## nbr.null 0.000 3025.000 14074.0000 0.000 0.00
## nbr.na 0.000 0.000 73299.0000 0.000 0.00
## min 15.000 0.000 0.0000 1.000 15.00
## max 97.000 24.000 25.0000 168.000 219000.00
## range 82.000 24.000 25.0000 167.000 218985.00
## sum 5071441.000 1351731.000 101711.0000 5322112.000 1257869452.00
## median 39.000 10.000 2.0000 45.000 8600.00
## mean 39.982 10.657 1.8996 41.958 9916.74
## SE.mean 0.040 0.012 0.0074 0.047 23.00
## CI.mean.0.95 0.078 0.023 0.0145 0.092 45.08
## var 201.269 17.781 2.9243 279.252 67109089.07
## std.dev 14.187 4.217 1.7101 16.711 8192.01
## coef.var 0.355 0.396 0.9002 0.398 0.83
## ing_x_hrs
## nbr.val 126843.000
## nbr.null 0.000
## nbr.na 0.000
## min 0.066
## max 9302.326
## range 9302.259
## sum 7990343.724
## median 46.512
## mean 62.994
## SE.mean 0.222
## CI.mean.0.95 0.436
## var 6266.224
## std.dev 79.159
## coef.var 1.257
Overall mean of each variable
stat.desc(select(enoe_clean, eda, anios_esc, n_hij, hrsocup, ingocup, ing_x_hrs))[c("mean"), ]
## eda anios_esc n_hij hrsocup ingocup ing_x_hrs
## mean 40 11 1.9 42 9917 63
With our data cleaned, we can start to visualize how hourly wage behaves. First, we plot the distribution of hourly wage in our observations.
library(ggplot2)
ggplot(enoe_clean, aes(x = ing_x_hrs)) +
geom_histogram(bins = 50) +
labs(title = "Hourly wage distribution", x = "Hourly wage (Mexican pesos)", y = "Frecuency")
As we can see, hourly wage is very skewed to the left. Let’s create a new variable where we applied logarithm to hourly wage.
enoe_clean <- mutate(enoe_clean, loghw = log(ing_x_hrs, base = 10))
summary(enoe_clean$loghw)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.2 1.5 1.7 1.7 1.8 4.0
ggplot(enoe_clean, aes(x = loghw)) +
geom_histogram(binwidth = 0.1) +
labs(title = "Hourly wage distribution (log)", x = "Log(Hourly wage in Mexican pesos)", y = "Frecuency")
Categorical variables
barplot(table(enoe_clean$t_loc_tri), main = "Distribution by locality size", xlab ="Locality size", ylab = "Frecuency")
barplot(table(enoe_clean$ent), main = "Distribution by state", ylab = "Frecuency", las = 2)
pie(table(enoe_clean$sex), main = "Distribution by sex", xlab ="Sex", ylab = "Frecuency")
pie(table(enoe_clean$seg_soc), main = "Distribution by access to social security", ylab = "Frecuency")
pie(table(enoe_clean$emp_ppal), main = "Distribution by formality", ylab = "Frecuency")
Numerical variables
Distribution by age and years of schooling
ggplot(enoe_clean, aes(x = eda)) +
geom_histogram(binwidth = 5) +
labs(title = "Distribution by age", x = "Age (years)", y = "Frecuency")
ggplot(enoe_clean, aes(x = anios_esc)) +
geom_histogram(binwidth = 1) +
labs(title = "Distribution by years of schooling", x = "Years", y = "Frecuency")
Distribution by number of children (only women). We delete the outliers
for our future model’s dataframe.
ggplot(enoe_clean, aes(x = n_hij)) +
geom_boxplot() +
labs(title = "Distribution by number of children (only women)", x = "Number of children")
## Warning: Removed 73299 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
enoe_for_model <- filter (enoe_clean, n_hij <= 10 | is.na(n_hij))
ggplot(enoe_for_model, aes(x = n_hij)) +
geom_boxplot() +
labs(title = "Distribution by number of children (only women)", x = "Number of children")
## Warning: Removed 73299 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Weekly work hours Distribution by weekly work hours. We delete the outliers for our future model’s dataframe.
ggplot(enoe_clean, aes(x = hrsocup)) +
geom_boxplot() +
labs(title = "Distribution by weekly work hours", x = "Hours")
enoe_for_model <- filter(enoe_clean, hrsocup <= 98)
ggplot(enoe_for_model, aes(x = hrsocup)) +
geom_boxplot() +
labs(title = "Distribution by weekly work hours", x = "Hours")
Categorical variables
ggplot(enoe_clean, aes(x = t_loc_tri, y = loghw)) +
geom_boxplot() +
labs(title = "Hourly wage by locality size", x = "Locality size", y = "Hourly wage")
ggplot(enoe_clean, aes(x = reorder(ent, loghw, median), y = loghw)) +
geom_boxplot() +
labs(title = "Hourly wage by state", x = "State", y = "Hourly wage") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
ggplot(enoe_clean, aes(x = reorder(sex, loghw, median), y = loghw)) +
geom_boxplot() +
labs(title = "Hourly wage by sex", x = "Sex", y = "Hourly wage")
ggplot(enoe_clean, aes(x = reorder(seg_soc, loghw, median), y = loghw)) +
geom_boxplot() +
labs(title = "Hourly wage by access to social security", x= "Access to social security", y = "Hourly wage")
ggplot(enoe_clean, aes(x = reorder(emp_ppal, loghw, median), y = loghw)) +
geom_boxplot() +
labs(title = "Hourly wage by formal or type of work", x= "Sector", y = "Hourly wage")
Numerical values
We skip weekly work hours and monthly wage as they are obviously dependent to hourly wage.
ggplot(enoe_clean, aes(x = eda, y = loghw)) +
geom_point() +
labs(title = "Hourly wage by age", x = "Age (years)", y = "Hourly wage")
ggplot(enoe_clean, aes(x = anios_esc, y = loghw)) +
geom_point() +
labs(title = "Hourly wage by years of schooling", x = "Years", y = "Hourly wage")
ggplot(enoe_clean, aes(x = n_hij, y = loghw)) +
geom_point() +
labs(title = "Hourly wage by number of children (only women)", x = "Children", y = "Hourly wage")
## Warning: Removed 73299 rows containing missing values or values outside the scale range
## (`geom_point()`).
First, we create a linear regression with the variables of size of locality, sex, type of work (formal/informal), age and years of schooling. For this model, we will use the hourly wage without any aditional operation (no log).
lm_1 <- lm(ing_x_hrs ~ t_loc_tri + sex +emp_ppal + eda + anios_esc, data=enoe_for_model)
summary(lm_1)
##
## Call:
## lm(formula = ing_x_hrs ~ t_loc_tri + sex + emp_ppal + eda + anios_esc,
## data = enoe_for_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -106 -28 -12 9 9231
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4109 1.0605 0.39 0.7
## t_loc_tri15k-100 hab. -5.5405 0.6599 -8.40 < 0.0000000000000002 ***
## t_loc_tri2.5k-15k hab. -8.8872 0.6707 -13.25 < 0.0000000000000002 ***
## t_loc_tri<2.5k hab. -9.1943 0.6506 -14.13 < 0.0000000000000002 ***
## sexwoman -6.1808 0.4399 -14.05 < 0.0000000000000002 ***
## emp_ppalInformal 2.2456 0.4730 4.75 0.0000021 ***
## eda 0.5583 0.0158 35.26 < 0.0000000000000002 ***
## anios_esc 4.2385 0.0586 72.37 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 77 on 126616 degrees of freedom
## Multiple R-squared: 0.0603, Adjusted R-squared: 0.0602
## F-statistic: 1.16e+03 on 7 and 126616 DF, p-value: <0.0000000000000002
Now, we try to build a model transforming hourly wage with logarithm base 10. This will make that “estimate” is in terms of percentages insteado of pesos.
lm_2 <- lm(loghw ~ t_loc_tri + sex +emp_ppal + eda + anios_esc, data=enoe_for_model)
summary(lm_2)
##
## Call:
## lm(formula = loghw ~ t_loc_tri + sex + emp_ppal + eda + anios_esc,
## data = enoe_for_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7249 -0.1539 -0.0079 0.1511 2.2144
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3783898 0.0039090 352.6 <0.0000000000000002 ***
## t_loc_tri15k-100 hab. -0.0377596 0.0024323 -15.5 <0.0000000000000002 ***
## t_loc_tri2.5k-15k hab. -0.0687288 0.0024723 -27.8 <0.0000000000000002 ***
## t_loc_tri<2.5k hab. -0.0910131 0.0023980 -38.0 <0.0000000000000002 ***
## sexwoman -0.0476430 0.0016215 -29.4 <0.0000000000000002 ***
## emp_ppalInformal 0.0706617 0.0017434 40.5 <0.0000000000000002 ***
## eda 0.0018784 0.0000584 32.2 <0.0000000000000002 ***
## anios_esc 0.0226342 0.0002159 104.8 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.28 on 126616 degrees of freedom
## Multiple R-squared: 0.167, Adjusted R-squared: 0.167
## F-statistic: 3.62e+03 on 7 and 126616 DF, p-value: <0.0000000000000002
In order to be able to understand estimate values of this second model, it is necessary to exponentiate the coefficient. The result is the percentage of the intercept that we could get if other variables are the same.
print(10^lm_2$coefficients)
## (Intercept) t_loc_tri15k-100 hab. t_loc_tri2.5k-15k hab.
## 23.90 0.92 0.85
## t_loc_tri<2.5k hab. sexwoman emp_ppalInformal
## 0.81 0.90 1.18
## eda anios_esc
## 1.00 1.05
Lastly, we could explore the impact of number of children in the hourly wage of women.
lm_3 <- lm(loghw ~ t_loc_tri +emp_ppal + eda + anios_esc + n_hij, data=filter(enoe_for_model, sex == "woman"))
print(summary(lm_3))
##
## Call:
## lm(formula = loghw ~ t_loc_tri + emp_ppal + eda + anios_esc +
## n_hij, data = filter(enoe_for_model, sex == "woman"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0444 -0.1585 -0.0088 0.1550 2.1979
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.284405 0.006498 197.66 <0.0000000000000002 ***
## t_loc_tri15k-100 hab. -0.038743 0.003794 -10.21 <0.0000000000000002 ***
## t_loc_tri2.5k-15k hab. -0.075510 0.003931 -19.21 <0.0000000000000002 ***
## t_loc_tri<2.5k hab. -0.089100 0.004038 -22.07 <0.0000000000000002 ***
## emp_ppalInformal 0.074612 0.002773 26.91 <0.0000000000000002 ***
## eda 0.001486 0.000113 13.20 <0.0000000000000002 ***
## anios_esc 0.026717 0.000359 74.38 <0.0000000000000002 ***
## n_hij 0.008403 0.000950 8.84 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.29 on 53491 degrees of freedom
## Multiple R-squared: 0.182, Adjusted R-squared: 0.182
## F-statistic: 1.7e+03 on 7 and 53491 DF, p-value: <0.0000000000000002
print(10^lm_3$coefficients)
## (Intercept) t_loc_tri15k-100 hab. t_loc_tri2.5k-15k hab.
## 19.25 0.91 0.84
## t_loc_tri<2.5k hab. emp_ppalInformal eda
## 0.81 1.19 1.00
## anios_esc n_hij
## 1.06 1.02
library(ggcorrplot)
## Warning: package 'ggcorrplot' was built under R version 4.4.2
corr_tab <- cor(select_if(enoe_for_model, is.numeric), use = "complete.obs")
ggcorrplot(corr_tab, hc.order = TRUE, lab=TRUE)
## Conclusions: