Question 1 - The authors discuss several approaches to estimating the labour market impact of immigrants on natives. Briefly compare these. Then explain briefly the observed skill downgrading by recent immigrants, and how this should affect the estimates.

The authors mention three main approaches to estimate the labor market impact of immigrants on natives, which are the following : the National skill-cell approach, the Pure Spatial approach, and the Mixture approach.
Each one of them has a different approach, based on the work on the studies of several authors : Borjas, Altonji and Card.
Firstly, they do not analyze the immigrants inflows at the same level. While the national skill-cell approach look at variation across education-experience cells on a national level, the mixture approach and the pure spatial approach look at the total immigrant flow across regions.
On one hand the national skill-cell and the mixture approach identify relative wage effects, and on the other hand the pure spatial approach identifies total wage effects of immigration inflows.
Moreover, these three models lead to different results and different interpretations on the effect of immigration on native wages.
In fact, the national skill-cell approach tends to produce more negative wage effects for natives than the mixture approach. Also, the Pure Spatial approach obtains estimates that vary widely depending on which skill group is studied.
Even though these results are different or contradictory, they cannot be compared in reality because they respond to different issues, as they study different samples and use different factors.
A major issue in the estimation of the immigration impact on natives wage is the one of downgrading. According to the article by Dustmann and Al, downgrading is defined as a situation where the position of immigrants in the labor market is lower than the one of natives even though they both have the same observed experience and education level. It means that immigrants obtain lower returns to the same measured skills than their native counterparts as they acquired their skills in their country of origin.
Downgrading can lead to biased estimations of the impact of immigration on natives wages. Especially, estimates based on skill-cell approach and mixture approach are going to be biased due to this problem.
In addition, downgrading-bias in the educational cell is small compared to the one in experience-cell, explaining why the mixture approach produces less-negative wage effects than the national skill-approach. So the different sample used by the two approaches (education-experience cell and education-cell) lead to more or less biased estimations.
The National skill-cell approach seems less subject to this bias as there is no need to assign immigrants to skill-cells (because this model studies a global effect and not a relative one). The national-total approach produces less biased estimates than the regional-relative approaches.
Downgrading can lead to understate the wage losses of native workers even if the model is well specified. In fact, downgrading lead to assume that immigrants and natives are perfect substitutes within cells, even though they aren’t.
Having said that, it seems that the crucial parameter in the predicted total wage effect of immigration is the estimated elasticity of substitution between immigrants and natives within skill-cells.
For example, in their study, Peri and Ottaviano suggest that immigrants and natives are imperfect substitutes within skill-cells. It is especially the case for low-skilled immigrants. In fact, when there is an inflow of low-skilled immigrants, the incumbent low-skilled immigrants will suffer most of the impact while wages of both high-skilled and low-skilled natives may increase.

Question 2 (Replication Tasks)

Q2.1 - Explore the presence of skill downgrading by estimating wage regression including education as a categorical variable and education interacted with age.

library (haven)
library(tidyverse)
library(gridExtra)
library(dplyr)
library(stargazer)
library(ggplot2)
library(plyr)
library(modelr)
library(readr)
library(sandwich)
library(ipumsr)
library(packrat)
library(rsconnect)
First we need to filter our data in order to keep only the pertinent variables
print(getwd())
## [1] "C:/Users/33662/Desktop"
read_dta(paste0(getwd(),"/census_acs_2000.dta")) %>%
  select(year, age, labforce, wkswork2, incwage, classwkr, educd, race,
         marst, empstat, bpld, yrimmig, sex, serial, occ) -> acs2000
Now we have to create a dummy variable called foreign. Plus we create the variable immclass that separate immigrants based on the years since their arrivals, a the variable schooling that categorizes individuals depending on their educational attainment, and the variable pe corresponding to the potential experience.
acs2000 %>%
  mutate(foreign = (bpld>=15000),
         immclass = case_when(foreign==1 & year-yrimmig<=2 ~ 1,
                              foreign==1 & year-yrimmig> 2 & year-yrimmig<=5 ~ 2,
                              foreign==1 & year-yrimmig> 5 & year-yrimmig<=10 ~ 3,
                              foreign==1 & year-yrimmig> 10 & !is.na(year-yrimmig) ~ 4)) %>%
  mutate(
    schooling = ifelse(educd > 113, 25, 
                       ifelse(educd > 100 & educd < 114, 17.2, 
                              ifelse(educd > 61 & educd < 101, 15.5, 
                                     ifelse(educd < 62, 8, 2)))))%>%
  mutate(pe = age - 6 - schooling,
         weeks = 7 * (wkswork2 == 1) + 20 * (wkswork2 == 2) +
           33 * (wkswork2 == 3) + 43.5 * (wkswork2 == 4) +
           48.5 * (wkswork2 == 5) + 51 * (wkswork2 == 6) ,
         lnw = log(incwage / weeks) ) -> acs2000
We only keep the ones who are in the labor force (labforce ==2), and those who work for wages and who are not self employed (classwkr == 2) and obviously those who are between 18 and 65 years old.
acs2000 %>%
  filter(classwkr == 2, pe >= 1 & pe <= 40, !is.na(lnw), lnw > 0 & lnw < Inf,
         age >= 18 & age <= 65, labforce == 2) -> acs2000
In order to do the regression, we have to use educd as a factor, using the factor() function. In order to observe skill downgrading, we have to compute the effect of the variable foreign on the wage for each group of immigrants (for the 4 different immclass)
modf <- lnw ~ foreign +age*educd + factor(educd)

acs2000$in_reg1 <- (acs2000$immclass ==1  | acs2000$foreign == 0)
sum(acs2000$in_reg1) 
## [1] 121919
mod1 <- lm(modf, data=acs2000, subset=in_reg1)

mean(acs2000$lnw[acs2000$immclass == 1], na.rm=TRUE) -
  mean(acs2000$lnw[acs2000$foreign == 0], na.rm=TRUE)
## [1] -0.3244095
acs2000$in_reg2 <- (acs2000$immclass == 2 | acs2000$foreign == 0)
sum(acs2000$in_reg2) 
## [1] 122189
mod2 <- lm(modf, data=acs2000, subset=in_reg2)


acs2000$in_reg3 <- (acs2000$immclass == 3 | acs2000$foreign == 0)
sum(acs2000$in_reg3) 
## [1] 123280
mod3 <- lm(modf, data=acs2000, subset=in_reg3)


acs2000$in_reg4 <- (acs2000$immclass == 4 | acs2000$foreign == 0)
sum(acs2000$in_reg4) 
## [1] 131090
mod4 <- lm(modf, data=acs2000, subset=in_reg4)

models = list(mod1, mod2, mod3, mod4)
stargazer(models,
          title = "log wage regressions : natves vs immigraqnt groups",
          type="text",
          keep="foreign")
## 
## log wage regressions : natves vs immigraqnt groups
## ===============================================================================================================================================
##                                                                         Dependent variable:                                                    
##                     ---------------------------------------------------------------------------------------------------------------------------
##                                                                                 lnw                                                            
##                                  (1)                            (2)                            (3)                            (4)              
## -----------------------------------------------------------------------------------------------------------------------------------------------
## foreign                       -0.156***                      -0.160***                      -0.103***                        -0.002            
##                                (0.021)                        (0.019)                        (0.015)                        (0.008)            
##                                                                                                                                                
## -----------------------------------------------------------------------------------------------------------------------------------------------
## Observations                   121,919                        122,189                        123,280                        131,090            
## R2                              0.187                          0.187                          0.187                          0.188             
## Adjusted R2                     0.187                          0.187                          0.187                          0.188             
## Residual Std. Error      0.734 (df = 121900)            0.734 (df = 122170)            0.734 (df = 123261)            0.733 (df = 131071)      
## F Statistic         1,556.525*** (df = 18; 121900) 1,562.084*** (df = 18; 122170) 1,577.890*** (df = 18; 123261) 1,683.805*** (df = 18; 131071)
## ===============================================================================================================================================
## Note:                                                                                                               *p<0.1; **p<0.05; ***p<0.01
We can see thanks to this table that downgrading is more severe for the recent immigrants but then tends to reduce for the immigrants that have been on the labor market for longer. In fact, after a few years on the labor market, immigrants tend to upgrade their skills and acquire new ones.
Now we have to predict log wages using separated models based on gender groups, on the sample of natives only.
acs2000$in_reg5 <- (acs2000$foreign == 0 | acs2000$sex == 1)
sum(acs2000$in_reg5)
## [1] 129629
mod5 <- lm(modf, data = acs2000, subset = in_reg5)

acs2000$in_reg6 <- (acs2000$foreign == 0 | acs2000$sex == 2)
sum(acs2000$in_reg6)
## [1] 127555
mod6 <- lm(modf, data = acs2000, subset = in_reg6)

pred_male_native <- predict(mod5, data = acs2000, subset = in_reg5) 
pred_female_native <- predict(mod6, data = acs2000, subset = in_reg6) 
print("The predicted average wage for native males is : ")
## [1] "The predicted average wage for native males is : "
mean(pred_male_native)
## [1] 6.35432
print("The predicted average wage for native females is : ")
## [1] "The predicted average wage for native females is : "
mean(pred_female_native)
## [1] 6.337697
It is interesting to see here that the average predicted wage for male natives is higher than the one of woman, implying that woman have a lower wage than man, and that sex is an important variable to take into account when studying impacts of wages.
After having estimated on natives only, we have to carry the prediction for everyone (natives and immigrants)
acs2000$in_reg7 <- (acs2000$sex == 1)
sum(acs2000$in_reg7)
## [1] 70800
mod7 <- lm(modf, data = acs2000, subset = in_reg7)

acs2000$in_reg8 <- (acs2000$sex == 2)
sum(acs2000$in_reg8)
## [1] 65737
mod8 <- lm(modf, data = acs2000, subset = in_reg8)

pred_male <- predict(mod7, data = acs2000)
pred_female <- predict(mod8, data = acs2000)

print("The average predicted wage for males is :") 
## [1] "The average predicted wage for males is :"
mean(pred_male)
## [1] 6.542091
print("The average predicted wage for females is :")
## [1] "The average predicted wage for females is :"
mean(pred_female)
## [1] 6.122443
mod9 <- lm(modf, data = acs2000)
lnw_pred <- predict(mod9, data = acs2000)
print("The average predicted wage on the whole sample is : ")
## [1] "The average predicted wage on the whole sample is : "
mean(lnw_pred)
## [1] 6.340048
Here, it can be stated that the predicted wage for women (native and immigrants) is lower than the one predicted for male (natives and immigrants). Moreover, the average wage predicted for the whole sample is lower than the one predicted for native male, and it may be because both male and female immigrants are taken into account in this sample.
Then we compute the standard error in order to have the variance of the error term, due to the presence of heteroscedasticity in the model. We extract the estimated variances from the diagonal (using diag) and obtain the s error using the standard error.
standard_res1 <- rstandard(mod5)
standard_res2 <- rstandard(mod6)
standard_res3 <- rstandard(mod7)
standard_res4 <- rstandard(mod8)
Sigma2 is the variance of the residuals, and this value is used to test the normality of residuals. Plus HC1 gives the white standard error. We do the imputation separately for gender, age and education groups.
nonrobust.se = sqrt(diag(vcov(mod9)))
robust.se = sqrt(diag(vcovCL(mod9, cluster= ~ sex,type="HC1"))) -> gender_noise

nonrobust.se = sqrt(diag(vcov(mod9)))
robust.se = sqrt(diag(vcovCL(mod9, cluster= ~ educd,type="HC1"))) -> education_noise

nonrobust.se = sqrt(diag(vcov(mod9)))
robust.se = sqrt(diag(vcovCL(mod9, cluster= ~ age,type="HC1"))) -> age_noise 

Q2.2 - Compute the percentile ranks in actual and predicted wage distribution.

acs2000%>%
  mutate(observed_percentile = rank(lnw)/length(lnw)) -> acs2000
acs2000%>%
  mutate(estimated_percentile = rank(lnw_pred)/length(lnw_pred)) -> acs2000

Q2.3 - Replicate figure 1.A and 1.D

First graph

plot(c(0,1),c(-0.5,1.5), type="n",
     xlab="percentile", ylab="density",
     main = "Position of foreign workers in native wage distribution")
lines(density(acs2000$estimated_percentile[(acs2000$foreign == 1)]),
      lty=1,lwd=2, col="black")
lines(density(acs2000$observed_percentile[(acs2000$foreign== 1)]),
      lty=2,lwd=2, col="blue")
geom_ref_line(h=0.0)
## mapping: yintercept = ~yintercept 
## geom_hline: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity
legend("bottomleft", legend=c("Observed value", "Estimated value"), lty=1:2, col=c("black", "blue") )

Interpretation :

this graph shows where recent immigrants are actually situated in the native wage distribution (dashed line representing actual density) and where we would assign them if they weren’t subject to skill-downgrading (solid lines representing predicted density).
The horizontal line references the native wage distribution and allow us to make comparisons with the immigrants wages distribution. In fact we see that immigrants are more concentrated at low wages (as the estimates are above the horizontal line) and that they are less concentrated than natives at high wages (as the estimates are below the horizontal line).
Also the solid line (showing where immigrants are actually located) lies above the dashed line (showing wheree immigrants should be located based on their education and experience) at low percentiles of the wage distribution, but it tends to be underneath the dashed line further up the wage distribution.
So we see that in all three countries of the sample (Germnay, UK and US), immigrants are, relatively to the natives, over represented at the bottom of wage distribution and under-represented in the middle or upper ends of the wage distribution.

Second graph

acs2000$in_reg <- (acs2000$immclass == 2 | acs2000$immclass == 3 | acs2000$foreign == 0)

plot(c(0,1),c(0.0,3.0), type="n",
     xlab="percentile", ylab="density",
     main = "Actual vs predicted position of foreign workers")
lines(density(acs2000$estimated_percentile[(acs2000$foreign == 1 & acs2000$immclass== 1)]),
      lty=2,lwd=2, col="dark green")
lines(density(acs2000$estimated_percentile[(acs2000$foreign== 1& acs2000$in_reg == TRUE)]),
      lty=2,lwd=2, col="orange")
lines(density(acs2000$estimated_percentile[(acs2000$foreign== 1& acs2000$immclass == 4)]),
      lty=2,lwd=2, col="purple")
legend("topleft", legend=c("Arrival <= 2 years", "Arrival 3-10 years","Arrival > 10 years"), lty=2, col=c("dark green", "orange","purple"))

Interpretation :

The graph shows three curves, each one representing a sample of immigrants based on the number of years since their arrival in the labor market. The three curves can be compared with the horizontal line that correpond to the distribution fo the natives.
When the estimates are above this line, immigrants are more concentrated than the working population, while when we are below this same line, it is the natives who are more concentrated

Question 3 (Data Wrangling) - Extract from ipums USA the relevant variables, compare your data extract to census_acs_2000.dta, as well as the key regression results.

First we extract the data from ipums selectionning the 5% sample (proving us 14 081 466 observations)
print(getwd())
## [1] "C:/Users/33662/Desktop"
ddi2000   <- read_ipums_ddi(paste0(getwd(),"/usa_00006.xml",sep=""))
dat2000 <- read_ipums_micro(ddi2000, data_file =
                          paste0(getwd(),"/usa_00006.dat",sep=""))
## Use of data from IPUMS USA is subject to conditions including that users should
## cite the data appropriately. Use command `ipums_conditions()` for more details.
dat2000 %>% 
  select(YEAR, AGE, LABFORCE, WKSWORK2, INCWAGE, CLASSWKR, EDUCD, RACE,
         MARST, EMPSTAT, BPLD, YRIMMIG, SEX, SERIAL, OCC) -> dat2000
dat2000 %>%
  mutate(FOREIGN = (BPLD>=15000),
         IMMCLASS = case_when(FOREIGN==1 & YEAR-YRIMMIG<=2 ~ 1,
                              FOREIGN==1 & YEAR-YRIMMIG> 2 & YEAR-YRIMMIG<=5 ~ 2,
                              FOREIGN==1 & YEAR-YRIMMIG> 5 & YEAR-YRIMMIG<=10 ~ 3,
                              FOREIGN==1 & YEAR-YRIMMIG> 10 & !is.na(YEAR-YRIMMIG) ~ 4)) %>%
  mutate(
    SCHOOLING = ifelse(EDUCD > 113, 25, 
                       ifelse(EDUCD > 100 & EDUCD < 114, 17.2, 
                              ifelse(EDUCD > 61 & EDUCD < 101, 15.5, 
                                     ifelse(EDUCD < 62, 8, 2)))))%>%
  mutate(PE = AGE - 6 - SCHOOLING,
         WEEKS = 7 * (WKSWORK2 == 1) + 20 * (WKSWORK2 == 2) +
           33 * (WKSWORK2 == 3) + 43.5 * (WKSWORK2 == 4) +
           48.5 * (WKSWORK2 == 5) + 51 * (WKSWORK2 == 6) ,
         LNW = log(INCWAGE / WEEKS) ) -> dat2000
dat2000 %>%
  filter(CLASSWKR == 2, PE >= 1 & PE <= 40, !is.na(LNW), LNW > 0 & LNW < Inf,
         AGE >= 18 & AGE <= 65, LABFORCE == 2) -> dat2000
modreg <- LNW ~ FOREIGN +AGE*EDUCD + factor(EDUCD)

dat2000$in_mod1 <- (dat2000$IMMCLASS ==1  | dat2000$FOREIGN == 0)
sum(dat2000$in_mod1) 
## [1] 4322013
modreg1 <- lm(modreg, data=dat2000, subset=in_mod1)

mean(dat2000$LNW[dat2000$IMMCLASS == 1], na.rm=TRUE) -
  mean(dat2000$LNW[dat2000$FOREIGN == 0], na.rm=TRUE)
## [1] -0.3736455
dat2000$in_mod2 <- (dat2000$IMMCLASS == 2 | dat2000$FOREIGN == 0)
sum(dat2000$in_mod2) 
## [1] 4335981
modreg2 <- lm(modreg, data=dat2000, subset=in_mod2)


dat2000$in_mod3 <- (dat2000$IMMCLASS == 3 | dat2000$FOREIGN == 0)
sum(dat2000$in_mod3) 
## [1] 4378968
modreg3 <- lm(modreg, data=dat2000, subset=in_mod3)


dat2000$in_mod4 <- (dat2000$IMMCLASS == 4 | dat2000$FOREIGN == 0)
sum(dat2000$in_mod4) 
## [1] 4670483
modreg4 <- lm(modreg, data=dat2000, subset=in_mod4)

modelsreg = list(modreg1, modreg2, modreg3, modreg4)
stargazer(modelsreg,
          title = "log wage regressions : natives vs immigrant groups (ipums data)",
          type="text", 
          keep = "FOREIGN")
## 
## log wage regressions : natives vs immigrant groups (ipums data)
## =======================================================================================================================================================
##                                                                             Dependent variable:                                                        
##                     -----------------------------------------------------------------------------------------------------------------------------------
##                                                                                     LNW                                                                
##                                   (1)                              (2)                              (3)                              (4)               
## -------------------------------------------------------------------------------------------------------------------------------------------------------
## FOREIGN                        -0.111***                        -0.104***                        -0.077***                         0.022***            
##                                 (0.003)                          (0.003)                          (0.002)                          (0.001)             
##                                                                                                                                                        
## -------------------------------------------------------------------------------------------------------------------------------------------------------
## Observations                   4,322,013                        4,335,981                        4,378,968                        4,670,483            
## R2                               0.187                            0.187                            0.187                            0.187              
## Adjusted R2                      0.187                            0.187                            0.187                            0.187              
## Residual Std. Error       0.719 (df = 4321994)             0.718 (df = 4335962)             0.718 (df = 4378949)             0.718 (df = 4670464)      
## F Statistic         55,179.410*** (df = 18; 4321994) 55,346.950*** (df = 18; 4335962) 55,877.380*** (df = 18; 4378949) 59,629.050*** (df = 18; 4670464)
## =======================================================================================================================================================
## Note:                                                                                                                       *p<0.1; **p<0.05; ***p<0.01

Interpretation :

We see the same kind of result as in the first sample, it is to say a presence of downgrading for the immigrants that recently arrived in the labor market (especially for group 1 and group 2) and that this downgrading tends to lower for the less recent groups (group 3). For group 4, immigrants that arrived more than 10 years ago, an up-grading can be observed as the coefficient of the variable FOREIGN becomes positive, but it is important to see that this coefficient is very small.

It is also possible to compare the regressions results and predictions on the models based on both gender.

dat2000$in_mod5 <- (dat2000$FOREIGN == 0 | dat2000$SEX == 1)
sum(dat2000$in_mod5)
## [1] 4628957
modreg5 <- lm(modreg, data = dat2000, subset = in_mod5)

dat2000$in_mod6 <- (dat2000$FOREIGN == 0 | dat2000$SEX == 2)
sum(dat2000$in_mod6)
## [1] 4529244
modreg6 <- lm(modreg, data = dat2000, subset = in_mod6)

pred_male_native_ipums <- predict(modreg5, data = dat2000, subset = in_mod5) 
pred_female_native_ipums <- predict(modreg6, data = dat2000, subset = in_mod6) 
print("The predicted average wage for native males is : ")
## [1] "The predicted average wage for native males is : "
mean(pred_male_native_ipums)
## [1] 6.32204
print("The predicted average wage for native females is : ")
## [1] "The predicted average wage for native females is : "
mean(pred_female_native_ipums)
## [1] 6.307255
It can be said that the disparity between native male wages and native woman wages is smaller than with the first sample, even though the predicted average wage of native women is still smaller than the one of native man. It may be due to the bigger amount of observations that lead to more precise results.
dat2000$in_mod7 <- (dat2000$SEX == 1)
sum(dat2000$in_mod7)
## [1] 2568125
modreg7 <- lm(modreg, data = dat2000, subset = in_mod7)

dat2000$in_mod8 <- (dat2000$SEX == 2)
sum(dat2000$in_mod8)
## [1] 2315454
modreg8 <- lm(modreg, data = dat2000, subset = in_mod8)

pred_male_ipums <- predict(modreg7, data = dat2000)
pred_female_ipums <- predict(modreg8, data = dat2000)

print("The average predicted wage for males is :") 
## [1] "The average predicted wage for males is :"
mean(pred_male_ipums)
## [1] 6.496294
print("The average predicted wage for females is :")
## [1] "The average predicted wage for females is :"
mean(pred_female_ipums)
## [1] 6.098582
modreg9 <- lm(modreg, data = dat2000)
LNW_PRED <- predict(modreg9, data = dat2000)
print("The average predicted wage on the whole sample is : ")
## [1] "The average predicted wage on the whole sample is : "
mean(LNW_PRED)
## [1] 6.307727
Here also, we see a disparity with the predicted male average wage higher than the one of the women. Also, With the ipumsUS sample, the whole average predicted wage (it is to say for the whole sample) is lower than the one predicted with the first sample (acs2000).