A tool to empirically study the relationship between education and wages is through Mincer equations. The following equation is the most basic specification of an income equation.
ln 𝑤 = 𝛽0 + 𝛽1𝑒𝑑𝑢 + 𝛽2𝑒𝑥𝑝 + 𝛽3𝑒𝑥𝑝2 + 𝛽4 woman + 𝜇 (1)
where: W is the income from the main occupation per hour, edu is the levels of education (a factor with six categories), exp is the years of experience, exp^2 is the squared years of experience, woman is a dichotomous variable indicating 1 if it is a woman and 0 otherwise, and 𝜇 is an error term.
The aim of this report is to estimate and interpret income equations using information from the Permanent Household Survey (EPH). The EPH is a national program in Argentina for the systematic and continuous production of social indicators carried out by the National Institute of Statistics and Census (INDEC) with the provincial statistical offices (DPE). It aims to survey the sociodemographic and socioeconomic characteristics of the population.
The database eph22.txt contains information extracted from the EPH for the fourth quarter of 2022 for the subsample of individuals aged between 18 and 65 years. It includes the following variables:
EPH does not include years of work experience, so potential years of experience in the labor market can be used as a proxy.
We will work with the EPH database from the fourth quarter of 2022, importing and filtering it in the following way:
base <- read.csv("eph22.txt", sep=";")
base <- base %>%
mutate(CH14 = ifelse(is.na(CH14) | CH14==99, 0, CH14)) %>%
na.omit(PP3E_TOT) %>%
filter(AGLOMERADO == 09,
P21 > 0,
PP3E_TOT > 0,
CH12<9,
CH13<3,
CH14!=98) %>%
rename(edu = NIVEL_ED)
To calculate potential experience, we first need to calculate the years a person studied.
Let’s create a variable “anoseduc” in which: If they finished (CH13) the highest level attended (CH12), then that quantity of years studied is assigned. If they didn’t finish, the years approved (CH14) from the unfinished level are added to the highest level approved (CH14).
We create a function to calculate this:
calculate_years_educ <- function(ch12, ch13, ch14) {
# Define years of study per educational level (in our cluster, primary is 6)
completed_levels_years <- c(0, 6, 9, 12, 12, 15, 17, 19)
incomplete_levels_years <-c(0, 0, 0, 6, 9, 12, 12, 17)
# Calculate years of study based on conditions
if (ch13 == 1) {
years_educ <- completed_levels_years[ch12]
} else {
years_educ <- incomplete_levels_years[ch12] + ch14
}
return(years_educ)
}
# Apply the function to the base to create the column
base$anoseduc <- mapply(
calculate_years_educ,
base$CH12,
base$CH13,
base$CH14
)
Continuing with the required variables for the model:
base$woman<-ifelse(base$CH04==2,1,0)
base$edu1<-ifelse(base$edu==1,1,0)
base$edu2<-ifelse(base$edu==2,1,0)
base$edu3<-ifelse(base$edu==3,1,0)
base$edu4<-ifelse(base$edu==4,1,0)
base$edu5<-ifelse(base$edu==5,1,0)
base$edu6<-ifelse(base$edu==6,1,0)
base <- base %>%
mutate(
exp = CH06 - (anoseduc + 6),
w = (P21 / (4 * PP3E_TOT)))
Hourly wages (w).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 36.46 520.83 833.33 973.21 1250.00 5208.33
Potential experience (exp).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 15.00 23.00 23.55 32.00 54.00
Distribution of hourly wages
ggplot(data = base) +
aes(x = w, weights = PONDIIO) +
geom_histogram(fill = "salmon", color = "black") +
labs(x = "Hourly wage (w)") +
labs(y = "Frequency") +
labs(title = "Distribution of hourly wages") +
labs(subtitle = "In the Comodoro Rivadavia - Rada Tilly cluster in the fourth quarter of 2022") +
labs(caption = "Source: Own elaboration based on EPH-INDEC")
Distribution of years of potential experience
ggplot(data = base) +
aes(x = exp, weights = PONDERA) +
geom_histogram(fill = "lightblue", color = "black") +
labs(x = "Years of potential experience") +
labs(y = "Frequency") +
labs(title = "Distribution of years of potential experience") +
labs(subtitle = "In the Comodoro Rivadavia - Rada Tilly cluster in the fourth quarter of 2022") +
labs(caption = "Source: Own elaboration based on EPH-INDEC")
Frecuencia del máximo nivel de educación alcanzado
base1 <- base %>%
select(w, edu, woman, PONDERA)
base1$edu <- recode_factor(base1$edu,
`1` = "Incomplete Primary",
`2` = "Complete Primary",
`3` = "Incomplete Secondary",
`4` = "Complete Secondary",
`5` = "Incomplete University",
`6` = "Complete University")
ggplot(data = base1) +
aes(x = edu, weights = PONDERA) +
geom_histogram(fill = "lightgreen", color = "black", stat = "count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Highest level of education attained") +
labs(y = "Frequency") +
labs(title = "Frequency of the highest level of education attained") +
labs(subtitle = "In the Comodoro Rivadavia - Rada Tilly cluster in the fourth quarter of 2022") +
labs(caption = "Source: Own elaboration based on EPH-INDEC")
Distribution of the highest level of education attained
edu_pie <- wtd.table(base$edu, weights = base$PONDERA)
percentage_edu <- round(edu_pie / sum(edu_pie) * 100, 2)
label_edu <- paste(percentage_edu, "%", sep = "")
pie(edu_pie, main = "Distribution of the highest level of education attained", cex.main = 1.0, labels = label_edu, radius = 0.8, border = "black", col = rainbow(length(edu_pie)))
title(xlab = "Source: Own elaboration based on EPH-INDEC", line = 3, cex.lab = 0.8)
mtext("In the Comodoro Rivadavia - Rada Tilly cluster in the fourth quarter of 2022", side = 3, line = 0.65, cex = 0.8)
legend("topleft", c("Incomplete Primary", "Complete Primary","Incomplete Secondary", "Complete Secondary", "Incomplete University", "Complete University"), cex = 0.7, fill = rainbow(length(edu_pie)))
Distribution of men and women in the sample
gender_pie <- table(base$woman)
percentages_gender = round(gender_pie/sum(gender_pie)*100,2)
label_gender <- paste(percentages_gender,"%",sep="")
pie(gender_pie, main= "Distribution of men and women in the sample", cex.main=1.2, labels=label_gender, radius=.8, border="black", col=c("red", "blue"))
mtext("In the Comodoro Rivadavia - Rada Tilly cluster in the fourth quarter of 2022", side = 3, line = 0.65, cex = 0.8)
legend("topright",c("Male", "Female"), cex = 0.8, fill=c("red", "blue"))
title(xlab = "Source: Own elaboration based on EPH-INDEC", line = 3, cex.lab = 0.8)
Boxplot of hourly wage according to the highest level of education attained
ggplot(data = base1, aes(x = edu, y = w)) +
geom_boxplot(aes(fill = edu, group = edu), color = "black") +
labs(x = "Highest level of education attained", y = "Hourly wage (w)",
title = "Boxplot of hourly wage according to highest level of education attained",
subtitle = "In the Comodoro Rivadavia - Rada Tilly cluster in the fourth quarter of 2022",
caption = "Source: Own elaboration based on EPH-INDEC") +
scale_y_continuous(limits = c(0, 5250)) +
coord_flip() +
theme_minimal()
Frequency by Education Level and Gender
count <- base %>%
group_by(edu, woman) %>%
summarise(count = n())
education_names <- c("Incomplete Primary", "Complete Primary",
"Incomplete Secondary", "Complete Secondary",
"Incomplete University", "Complete University")
ggplot(count, aes(x = factor(edu), y = count, fill = factor(woman))) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Highest level of education attained", y = "Frequency", fill = "Gender") +
scale_fill_manual(values = c("lightblue", "pink"), labels = c("Males", "Females")) +
ggtitle("Frequency by Education Level and Gender") +
scale_x_discrete(labels = education_names) +
labs(subtitle = "In the Comodoro Rivadavia - Rada Tilly cluster in the fourth quarter of 2022") +
labs(caption = "Source: Own elaboration based on EPH-INDEC") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Boxplot of Hourly Wage by Education Level and Gender
ggplot(base, aes(x = factor(edu), y = w, fill = factor(woman))) +
geom_boxplot(position = position_dodge(width = 0.8)) +
labs(x = "Highest level of education attained", y = "Hourly wage (w)", fill = "Gender") +
scale_fill_manual(values = c("blue", "pink"), labels = c("Males", "Females")) +
labs(subtitle = "In the Comodoro Rivadavia - Rada Tilly cluster in the fourth quarter of 2022") +
labs(caption = "Source: Own elaboration based on EPH-INDEC") +
scale_x_discrete(labels = education_names) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
ggtitle("Boxplot of Hourly Wage by Education Level and Gender")
Let’s estimate Equation (1) mentioned earlier using a linear regression model.
modelo1 <- lm(log(w) ~ edu2+edu3+edu4+edu5+edu6+exp+I(exp^2)+woman, data = base, weights = PONDIIO)
##
## Call:
## lm(formula = log(w) ~ edu2 + edu3 + edu4 + edu5 + edu6 + exp +
## I(exp^2) + woman, data = base, weights = PONDIIO)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -74.243 -5.682 -0.387 5.647 34.238
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.6631833 0.2295795 24.668 < 2e-16 ***
## edu2 0.2445547 0.2133178 1.146 0.252467
## edu3 0.5125422 0.2064326 2.483 0.013542 *
## edu4 0.5675329 0.2059295 2.756 0.006185 **
## edu5 0.8154589 0.2203202 3.701 0.000252 ***
## edu6 1.1663373 0.2183201 5.342 1.74e-07 ***
## exp 0.0347174 0.0102778 3.378 0.000820 ***
## I(exp^2) -0.0005263 0.0002026 -2.598 0.009818 **
## woman -0.2033179 0.0706120 -2.879 0.004251 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.06 on 322 degrees of freedom
## Multiple R-squared: 0.2005, Adjusted R-squared: 0.1806
## F-statistic: 10.09 on 8 and 322 DF, p-value: 1.418e-12
Coefficients:
B0 or Intercept: The average hourly wage of the baseline group, men without experience and with incomplete primary education, is 288.06 Argentine pesos (with all else constant).
B1: The hourly wage of individuals who completed primary education (NIVEL_ED=2) is on average 27.71% higher than those who did not complete it (with all else constant). It is not statistically significant at the 0.05 significance level, as its p-value is higher.
B2: The hourly wage of individuals with incomplete secondary education (NIVEL_ED=3) is on average 66.95% higher than that of the baseline group (with all else constant). It is statistically significant at the 0.05 significance level, as its p-value is lower.
B3: The hourly wage of individuals who completed secondary education (NIVEL_ED=4) is on average 76.39% higher than the baseline group (with all else constant). It is statistically significant at the 0.05 significance level, as its p-value is lower.
B4: The hourly wage of individuals with incomplete university-level education (NIVEL_ED=5) is on average 126.02% higher than the baseline group (with all else constant). It is statistically significant at the 0.05 significance level, as its p-value is lower.
B5: The hourly wage of individuals who completed university-level education (NIVEL_ED=6) is on average 221.02% higher than the baseline group (with all else constant). It is statistically significant at the 0.05 significance level, as its p-value is lower.
B6: With a 1-unit increase in potential years of experience, the average hourly wage increases by 3.47% (with all else constant). It is statistically significant at the 0.05 significance level, as its p-value is lower.
B7: With a 1-unit increase in the square of potential experience, the average hourly wage decreases by -0.05% (with all else constant). It is statistically significant at the 0.05 significance level, as its p-value is lower.
B8: Following the logarithmic approximation method and keeping other factors constant, the hourly wage of women is on average 18.4% lower than that of men. It is statistically significant at the 0.05 significance level, as its p-value is lower.
It’s essential to determine if they follow a normal distribution or not because it’s a key assumption required for inference.
To understand the type of distribution of the residuals, we’ll conduct a normality test, specifically the Jarque-Bera test:
We establish hypotheses for the normality of errors:
H0: Errors (\(\mu\)) are normally distributed (\(\mu \sim N(\mu, \sigma^2)\))
H1: Errors (\(\mu\)) are not normally distributed (\(\mu \not\sim N(\mu, \sigma^2)\))
We set a significance level of 0.05
Decision rule: if the p-value associated with the test is less than the significance level -> Reject H0
Test
residuos <- residuals(modelo1)
test_result <- jarque.bera.test(residuos)
##
## Jarque Bera Test
##
## data: residuos
## X-squared = 79.193, df = 2, p-value < 2.2e-16
Decision As the p-value is less than the chosen significance level (0.05), there is sufficient evidence to reject the null hypothesis, indicating that the residuals are not normally distributed.
Histogram
hist(x = residuos, main = "Residuals Histogram", xlab = "Residuals", ylab = "Frequency")
Visually, the residuals seem to exhibit a slight leftward skew.