Q2.1 - Explore the presence of skill downgrading by estimating wage
regression including education as a categorical variable and education
interacted with age.
library (haven)
library(tidyverse)
library(gridExtra)
library(dplyr)
library(stargazer)
library(ggplot2)
library(plyr)
library(modelr)
library(readr)
library(sandwich)
library(ipumsr)
library(packrat)
library(rsconnect)
First we need to filter our data in order to keep only the pertinent
variables
print(getwd())
## [1] "C:/Users/33662/Desktop"
read_dta(paste0(getwd(),"/census_acs_2000.dta")) %>%
select(year, age, labforce, wkswork2, incwage, classwkr, educd, race,
marst, empstat, bpld, yrimmig, sex, serial, occ) -> acs2000
Now we have to create a dummy variable called foreign. Plus we
create the variable immclass that separate immigrants based on the years
since their arrivals, a the variable schooling that categorizes
individuals depending on their educational attainment, and the variable
pe corresponding to the potential experience.
acs2000 %>%
mutate(foreign = (bpld>=15000),
immclass = case_when(foreign==1 & year-yrimmig<=2 ~ 1,
foreign==1 & year-yrimmig> 2 & year-yrimmig<=5 ~ 2,
foreign==1 & year-yrimmig> 5 & year-yrimmig<=10 ~ 3,
foreign==1 & year-yrimmig> 10 & !is.na(year-yrimmig) ~ 4)) %>%
mutate(
schooling = ifelse(educd > 113, 25,
ifelse(educd > 100 & educd < 114, 17.2,
ifelse(educd > 61 & educd < 101, 15.5,
ifelse(educd < 62, 8, 2)))))%>%
mutate(pe = age - 6 - schooling,
weeks = 7 * (wkswork2 == 1) + 20 * (wkswork2 == 2) +
33 * (wkswork2 == 3) + 43.5 * (wkswork2 == 4) +
48.5 * (wkswork2 == 5) + 51 * (wkswork2 == 6) ,
lnw = log(incwage / weeks) ) -> acs2000
We only keep the ones who are in the labor force (labforce ==2), and
those who work for wages and who are not self employed (classwkr == 2)
and obviously those who are between 18 and 65 years old.
acs2000 %>%
filter(classwkr == 2, pe >= 1 & pe <= 40, !is.na(lnw), lnw > 0 & lnw < Inf,
age >= 18 & age <= 65, labforce == 2) -> acs2000
In order to do the regression, we have to use educd as a factor,
using the factor() function. In order to observe skill downgrading, we
have to compute the effect of the variable foreign on the wage for each
group of immigrants (for the 4 different immclass)
modf <- lnw ~ foreign +age*educd + factor(educd)
acs2000$in_reg1 <- (acs2000$immclass ==1 | acs2000$foreign == 0)
sum(acs2000$in_reg1)
## [1] 121919
mod1 <- lm(modf, data=acs2000, subset=in_reg1)
mean(acs2000$lnw[acs2000$immclass == 1], na.rm=TRUE) -
mean(acs2000$lnw[acs2000$foreign == 0], na.rm=TRUE)
## [1] -0.3244095
acs2000$in_reg2 <- (acs2000$immclass == 2 | acs2000$foreign == 0)
sum(acs2000$in_reg2)
## [1] 122189
mod2 <- lm(modf, data=acs2000, subset=in_reg2)
acs2000$in_reg3 <- (acs2000$immclass == 3 | acs2000$foreign == 0)
sum(acs2000$in_reg3)
## [1] 123280
mod3 <- lm(modf, data=acs2000, subset=in_reg3)
acs2000$in_reg4 <- (acs2000$immclass == 4 | acs2000$foreign == 0)
sum(acs2000$in_reg4)
## [1] 131090
mod4 <- lm(modf, data=acs2000, subset=in_reg4)
models = list(mod1, mod2, mod3, mod4)
stargazer(models,
title = "log wage regressions : natves vs immigraqnt groups",
type="text",
keep="foreign")
##
## log wage regressions : natves vs immigraqnt groups
## ===============================================================================================================================================
## Dependent variable:
## ---------------------------------------------------------------------------------------------------------------------------
## lnw
## (1) (2) (3) (4)
## -----------------------------------------------------------------------------------------------------------------------------------------------
## foreign -0.156*** -0.160*** -0.103*** -0.002
## (0.021) (0.019) (0.015) (0.008)
##
## -----------------------------------------------------------------------------------------------------------------------------------------------
## Observations 121,919 122,189 123,280 131,090
## R2 0.187 0.187 0.187 0.188
## Adjusted R2 0.187 0.187 0.187 0.188
## Residual Std. Error 0.734 (df = 121900) 0.734 (df = 122170) 0.734 (df = 123261) 0.733 (df = 131071)
## F Statistic 1,556.525*** (df = 18; 121900) 1,562.084*** (df = 18; 122170) 1,577.890*** (df = 18; 123261) 1,683.805*** (df = 18; 131071)
## ===============================================================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
We can see thanks to this table that downgrading is more severe for
the recent immigrants but then tends to reduce for the immigrants that
have been on the labor market for longer. In fact, after a few years on
the labor market, immigrants tend to upgrade their skills and acquire
new ones.
Now we have to predict log wages using separated models based on
gender groups, on the sample of natives only.
acs2000$in_reg5 <- (acs2000$foreign == 0 | acs2000$sex == 1)
sum(acs2000$in_reg5)
## [1] 129629
mod5 <- lm(modf, data = acs2000, subset = in_reg5)
acs2000$in_reg6 <- (acs2000$foreign == 0 | acs2000$sex == 2)
sum(acs2000$in_reg6)
## [1] 127555
mod6 <- lm(modf, data = acs2000, subset = in_reg6)
pred_male_native <- predict(mod5, data = acs2000, subset = in_reg5)
pred_female_native <- predict(mod6, data = acs2000, subset = in_reg6)
print("The predicted average wage for native males is : ")
## [1] "The predicted average wage for native males is : "
mean(pred_male_native)
## [1] 6.35432
print("The predicted average wage for native females is : ")
## [1] "The predicted average wage for native females is : "
mean(pred_female_native)
## [1] 6.337697
It is interesting to see here that the average predicted wage for
male natives is higher than the one of woman, implying that woman have a
lower wage than man, and that sex is an important variable to take into
account when studying impacts of wages.
After having estimated on natives only, we have to carry the
prediction for everyone (natives and immigrants)
acs2000$in_reg7 <- (acs2000$sex == 1)
sum(acs2000$in_reg7)
## [1] 70800
mod7 <- lm(modf, data = acs2000, subset = in_reg7)
acs2000$in_reg8 <- (acs2000$sex == 2)
sum(acs2000$in_reg8)
## [1] 65737
mod8 <- lm(modf, data = acs2000, subset = in_reg8)
pred_male <- predict(mod7, data = acs2000)
pred_female <- predict(mod8, data = acs2000)
print("The average predicted wage for males is :")
## [1] "The average predicted wage for males is :"
mean(pred_male)
## [1] 6.542091
print("The average predicted wage for females is :")
## [1] "The average predicted wage for females is :"
mean(pred_female)
## [1] 6.122443
mod9 <- lm(modf, data = acs2000)
lnw_pred <- predict(mod9, data = acs2000)
print("The average predicted wage on the whole sample is : ")
## [1] "The average predicted wage on the whole sample is : "
mean(lnw_pred)
## [1] 6.340048
Here, it can be stated that the predicted wage for women (native and
immigrants) is lower than the one predicted for male (natives and
immigrants). Moreover, the average wage predicted for the whole sample
is lower than the one predicted for native male, and it may be because
both male and female immigrants are taken into account in this
sample.
Then we compute the standard error in order to have the variance of
the error term, due to the presence of heteroscedasticity in the model.
We extract the estimated variances from the diagonal (using diag) and
obtain the s error using the standard error.
standard_res1 <- rstandard(mod5)
standard_res2 <- rstandard(mod6)
standard_res3 <- rstandard(mod7)
standard_res4 <- rstandard(mod8)
Sigma2 is the variance of the residuals, and this value is used to
test the normality of residuals. Plus HC1 gives the white standard
error. We do the imputation separately for gender, age and education
groups.
nonrobust.se = sqrt(diag(vcov(mod9)))
robust.se = sqrt(diag(vcovCL(mod9, cluster= ~ sex,type="HC1"))) -> gender_noise
nonrobust.se = sqrt(diag(vcov(mod9)))
robust.se = sqrt(diag(vcovCL(mod9, cluster= ~ educd,type="HC1"))) -> education_noise
nonrobust.se = sqrt(diag(vcov(mod9)))
robust.se = sqrt(diag(vcovCL(mod9, cluster= ~ age,type="HC1"))) -> age_noise