Objective of the report

A tool to empirically study the relationship between education and wages is through Mincer equations. The following equation is the most basic specification of an income equation.

ln 𝑤 = 𝛽0 + 𝛽1𝑒𝑑𝑢 + 𝛽2𝑒𝑥𝑝 + 𝛽3𝑒𝑥𝑝2 + 𝛽4 woman + 𝜇 (1)

where: W is the income from the main occupation per hour, edu is the levels of education (a factor with six categories), exp is the years of experience, exp^2 is the squared years of experience, woman is a dichotomous variable indicating 1 if it is a woman and 0 otherwise, and 𝜇 is an error term.

The aim of this report is to estimate and interpret income equations using information from the Permanent Household Survey (EPH). The EPH is a national program in Argentina for the systematic and continuous production of social indicators carried out by the National Institute of Statistics and Census (INDEC) with the provincial statistical offices (DPE). It aims to survey the sociodemographic and socioeconomic characteristics of the population.

The database eph22.txt contains information extracted from the EPH for the fourth quarter of 2022 for the subsample of individuals aged between 18 and 65 years. It includes the following variables:

Age (CH06).
Gender (CH04).
Level of education (NIVEL_ED).
Total hours worked weekly in the main occupation (PP3E_TOT).
Amount of income from the main occupation (P21).
Highest level attended or completed (CH12).
Did they finish that level? (CH13).
What was the last year approved? (CH14).
Weighting, without correction for non-response, used in all variables except income-related ones (PONDERA).
Weighting for income from the main occupation (PONDIIO).

EPH does not include years of work experience, so potential years of experience in the labor market can be used as a proxy.

Data Import and Preprocessing

We will work with the EPH database from the fourth quarter of 2022, importing and filtering it in the following way:

NA and 99 values in CH14 are set to 0.
Omit NA values from PP3E_TOT.
Include only data from the assigned cluster: Comodoro Rivadavia - Rada Tilly (09).
Exclude incomes less than zero to subsequently logarithmize.
Ensure there are no observations of special education in CH12 and CH14 and no non-response in CH13.
Rename “NIVEL_ED” to “edu” for simplification.

base <- read.csv("eph22.txt", sep=";")

base <- base %>%
  mutate(CH14 = ifelse(is.na(CH14) | CH14==99, 0, CH14)) %>% 
  na.omit(PP3E_TOT) %>%    
  filter(AGLOMERADO == 09, 
         P21 > 0,          
         PP3E_TOT > 0,
         CH12<9,           
         CH13<3,
         CH14!=98) %>% 
  rename(edu = NIVEL_ED)

A) Calculation of Variables Required for the Model

To calculate potential experience, we first need to calculate the years a person studied.

Let’s create a variable “anoseduc” in which: If they finished (CH13) the highest level attended (CH12), then that quantity of years studied is assigned. If they didn’t finish, the years approved (CH14) from the unfinished level are added to the highest level approved (CH14).

We create a function to calculate this:

calculate_years_educ <- function(ch12, ch13, ch14) {
# Define years of study per educational level (in our cluster, primary is 6)
 completed_levels_years <- c(0, 6, 9, 12, 12, 15, 17, 19)
 incomplete_levels_years <-c(0, 0, 0, 6, 9, 12, 12, 17)
  
# Calculate years of study based on conditions
 if (ch13 == 1) {
    years_educ <- completed_levels_years[ch12]
  } else {
    years_educ <- incomplete_levels_years[ch12] + ch14
  }
  
  return(years_educ)
}

# Apply the function to the base to create the column
base$anoseduc <- mapply(
  calculate_years_educ, 
  base$CH12, 
  base$CH13, 
  base$CH14
)

Woman, edu, exp, and w

Continuing with the required variables for the model:

We create ‘woman’ based on CH04 (which takes the value 2 if it is a woman and 1 if it is a man) to make it a dichotomous variable where it takes a value of 1 if it is a woman and 0 if it is a man.
As edu is a categorical or ordinal variable, we will break it down by level.
We calculate potential experience as follows: age(CH06) - years studied (anoseduc) - 6.
Finally, w is the hourly wage estimated (in pesos argentinos) from: monthly income from the main occupation (P21) divided by weekly hours worked (PP3E_TOT) multiplied by 4.

base$woman<-ifelse(base$CH04==2,1,0)

base$edu1<-ifelse(base$edu==1,1,0)
base$edu2<-ifelse(base$edu==2,1,0)
base$edu3<-ifelse(base$edu==3,1,0)
base$edu4<-ifelse(base$edu==4,1,0)
base$edu5<-ifelse(base$edu==5,1,0)
base$edu6<-ifelse(base$edu==6,1,0)

base <- base %>%
  mutate(
    exp = CH06 - (anoseduc + 6), 
    w = (P21 / (4 * PP3E_TOT)))

Descriptive Analysis

1- Univariate

Weighted descriptive measures of the sample from the Comodoro Rivadavia - Rada Tilly cluster in the fourth quarter of 2022 of the EPH.

Hourly wages (w).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   36.46  520.83  833.33  973.21 1250.00 5208.33

Potential experience (exp).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   15.00   23.00   23.55   32.00   54.00

Distribution w, exp, and edu in the sample

Distribution of hourly wages

ggplot(data = base) +  
  aes(x = w, weights = PONDIIO) +  
  geom_histogram(fill = "salmon", color = "black") +
  labs(x = "Hourly wage (w)") +  
  labs(y = "Frequency") +  
  labs(title = "Distribution of hourly wages") +
  labs(subtitle = "In the Comodoro Rivadavia - Rada Tilly cluster in the fourth quarter of 2022") +
  labs(caption = "Source: Own elaboration based on EPH-INDEC")

Distribution of years of potential experience

ggplot(data = base) +  
  aes(x = exp, weights = PONDERA) +  
  geom_histogram(fill = "lightblue", color = "black") +
  labs(x = "Years of potential experience") +  
  labs(y = "Frequency") +  
  labs(title = "Distribution of years of potential experience") +
  labs(subtitle = "In the Comodoro Rivadavia - Rada Tilly cluster in the fourth quarter of 2022") +
  labs(caption = "Source: Own elaboration based on EPH-INDEC")

Frecuencia del máximo nivel de educación alcanzado

base1 <- base %>%
  select(w, edu, woman, PONDERA)

base1$edu <- recode_factor(base1$edu,
                           `1` = "Incomplete Primary",
                           `2` = "Complete Primary",
                           `3` = "Incomplete Secondary",
                           `4` = "Complete Secondary",
                           `5` = "Incomplete University",
                           `6` = "Complete University")

ggplot(data = base1) +  
  aes(x = edu, weights = PONDERA) +  
  geom_histogram(fill = "lightgreen", color = "black", stat = "count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Highest level of education attained") +  
  labs(y = "Frequency") +  
  labs(title = "Frequency of the highest level of education attained") +
  labs(subtitle = "In the Comodoro Rivadavia - Rada Tilly cluster in the fourth quarter of 2022") +
  labs(caption = "Source: Own elaboration based on EPH-INDEC")

Distribution of the highest level of education attained

edu_pie <- wtd.table(base$edu, weights = base$PONDERA)
percentage_edu <- round(edu_pie / sum(edu_pie) * 100, 2)
label_edu <- paste(percentage_edu, "%", sep = "")
pie(edu_pie, main = "Distribution of the highest level of education attained", cex.main = 1.0, labels = label_edu, radius = 0.8, border = "black", col = rainbow(length(edu_pie)))
title(xlab = "Source: Own elaboration based on EPH-INDEC", line = 3, cex.lab = 0.8)
mtext("In the Comodoro Rivadavia - Rada Tilly cluster in the fourth quarter of 2022", side = 3, line = 0.65, cex = 0.8)
legend("topleft", c("Incomplete Primary", "Complete Primary","Incomplete Secondary", "Complete Secondary", "Incomplete University", "Complete University"), cex = 0.7, fill = rainbow(length(edu_pie)))

Distribution of men and women in the sample

gender_pie <- table(base$woman)
percentages_gender = round(gender_pie/sum(gender_pie)*100,2)
label_gender <- paste(percentages_gender,"%",sep="") 
pie(gender_pie, main= "Distribution of men and women in the sample", cex.main=1.2, labels=label_gender, radius=.8, border="black", col=c("red", "blue"))
mtext("In the Comodoro Rivadavia - Rada Tilly cluster in the fourth quarter of 2022", side = 3, line = 0.65, cex = 0.8)
legend("topright",c("Male", "Female"), cex = 0.8, fill=c("red", "blue"))
title(xlab = "Source: Own elaboration based on EPH-INDEC", line = 3, cex.lab = 0.8)

Bivariate Analysis (relationship between pairs of variables)

Boxplot of hourly wage according to the highest level of education attained

ggplot(data = base1, aes(x = edu, y = w)) +
  geom_boxplot(aes(fill = edu, group = edu), color = "black") +
  labs(x = "Highest level of education attained", y = "Hourly wage (w)",
       title = "Boxplot of hourly wage according to highest level of education attained",
       subtitle = "In the Comodoro Rivadavia - Rada Tilly cluster in the fourth quarter of 2022",
       caption = "Source: Own elaboration based on EPH-INDEC") +
  scale_y_continuous(limits = c(0, 5250)) +
  coord_flip() +
  theme_minimal()

Frequency by Education Level and Gender

count <- base %>%
  group_by(edu, woman) %>%
  summarise(count = n()) 

education_names <- c("Incomplete Primary", "Complete Primary", 
                     "Incomplete Secondary", "Complete Secondary", 
                     "Incomplete University", "Complete University")

ggplot(count, aes(x = factor(edu), y = count, fill = factor(woman))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Highest level of education attained", y = "Frequency", fill = "Gender") +
  scale_fill_manual(values = c("lightblue", "pink"), labels = c("Males", "Females")) + 
  ggtitle("Frequency by Education Level and Gender") +
  scale_x_discrete(labels = education_names) +
  labs(subtitle = "In the Comodoro Rivadavia - Rada Tilly cluster in the fourth quarter of 2022") +
  labs(caption = "Source: Own elaboration based on EPH-INDEC") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Boxplot of Hourly Wage by Education Level and Gender

ggplot(base, aes(x = factor(edu), y = w, fill = factor(woman))) +
  geom_boxplot(position = position_dodge(width = 0.8)) +
  labs(x = "Highest level of education attained", y = "Hourly wage (w)", fill = "Gender") +
  scale_fill_manual(values = c("blue", "pink"), labels = c("Males", "Females")) +
  labs(subtitle = "In the Comodoro Rivadavia - Rada Tilly cluster in the fourth quarter of 2022") +
  labs(caption = "Source: Own elaboration based on EPH-INDEC") +
  scale_x_discrete(labels = education_names) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  ggtitle("Boxplot of Hourly Wage by Education Level and Gender")

B) Model Estimation

Let’s estimate Equation (1) mentioned earlier using a linear regression model.

modelo1 <- lm(log(w) ~ edu2+edu3+edu4+edu5+edu6+exp+I(exp^2)+woman, data = base, weights = PONDIIO)

## 
## Call:
## lm(formula = log(w) ~ edu2 + edu3 + edu4 + edu5 + edu6 + exp + 
##     I(exp^2) + woman, data = base, weights = PONDIIO)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -74.243  -5.682  -0.387   5.647  34.238 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.6631833  0.2295795  24.668  < 2e-16 ***
## edu2         0.2445547  0.2133178   1.146 0.252467    
## edu3         0.5125422  0.2064326   2.483 0.013542 *  
## edu4         0.5675329  0.2059295   2.756 0.006185 ** 
## edu5         0.8154589  0.2203202   3.701 0.000252 ***
## edu6         1.1663373  0.2183201   5.342 1.74e-07 ***
## exp          0.0347174  0.0102778   3.378 0.000820 ***
## I(exp^2)    -0.0005263  0.0002026  -2.598 0.009818 ** 
## woman       -0.2033179  0.0706120  -2.879 0.004251 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.06 on 322 degrees of freedom
## Multiple R-squared:  0.2005, Adjusted R-squared:  0.1806 
## F-statistic: 10.09 on 8 and 322 DF,  p-value: 1.418e-12

Interpretation of Results

Coefficients:

B0 or Intercept: The average hourly wage of the baseline group, men without experience and with incomplete primary education, is 288.06 Argentine pesos (with all else constant).

B1: The hourly wage of individuals who completed primary education (NIVEL_ED=2) is on average 27.71% higher than those who did not complete it (with all else constant). It is not statistically significant at the 0.05 significance level, as its p-value is higher.

B2: The hourly wage of individuals with incomplete secondary education (NIVEL_ED=3) is on average 66.95% higher than that of the baseline group (with all else constant). It is statistically significant at the 0.05 significance level, as its p-value is lower.

B3: The hourly wage of individuals who completed secondary education (NIVEL_ED=4) is on average 76.39% higher than the baseline group (with all else constant). It is statistically significant at the 0.05 significance level, as its p-value is lower.

B4: The hourly wage of individuals with incomplete university-level education (NIVEL_ED=5) is on average 126.02% higher than the baseline group (with all else constant). It is statistically significant at the 0.05 significance level, as its p-value is lower.

B5: The hourly wage of individuals who completed university-level education (NIVEL_ED=6) is on average 221.02% higher than the baseline group (with all else constant). It is statistically significant at the 0.05 significance level, as its p-value is lower.

B6: With a 1-unit increase in potential years of experience, the average hourly wage increases by 3.47% (with all else constant). It is statistically significant at the 0.05 significance level, as its p-value is lower.

B7: With a 1-unit increase in the square of potential experience, the average hourly wage decreases by -0.05% (with all else constant). It is statistically significant at the 0.05 significance level, as its p-value is lower.

B8: Following the logarithmic approximation method and keeping other factors constant, the hourly wage of women is on average 18.4% lower than that of men. It is statistically significant at the 0.05 significance level, as its p-value is lower.

c) What type of distribution do the residuals of the proposed model follow? Why is it important to consider this?

It’s essential to determine if they follow a normal distribution or not because it’s a key assumption required for inference.

To understand the type of distribution of the residuals, we’ll conduct a normality test, specifically the Jarque-Bera test:

We establish hypotheses for the normality of errors:

H0: Errors (\(\mu\)) are normally distributed (\(\mu \sim N(\mu, \sigma^2)\))

H1: Errors (\(\mu\)) are not normally distributed (\(\mu \not\sim N(\mu, \sigma^2)\))

We set a significance level of 0.05

Decision rule: if the p-value associated with the test is less than the significance level -> Reject H0

Test

residuos <- residuals(modelo1)

test_result <- jarque.bera.test(residuos)

## 
##  Jarque Bera Test
## 
## data:  residuos
## X-squared = 79.193, df = 2, p-value < 2.2e-16

Decision As the p-value is less than the chosen significance level (0.05), there is sufficient evidence to reject the null hypothesis, indicating that the residuals are not normally distributed.

Histogram

hist(x = residuos, main = "Residuals Histogram", xlab = "Residuals", ylab = "Frequency")

Visually, the residuals seem to exhibit a slight leftward skew.

Report on the ‘Comodoro Rivadavia - Rada Tilly’ cluster in the fourth quarter of 2022

Luis Francisco Fernández

2023