In this lab exercise, you will learn:
The gender pay gap or gender wage gap is the average difference between the remuneration for men and women who are working. Women are generally considered to be paid less than men. In this exercise, we are going to revisit this issue using WAGE1. This data set is part of the R package wooldridge.
rm(list=ls())
Let’s load all the packages needed for this exercise (this assumes you’ve already installed them).
#install.packages("wooldridge") # install R package "wooldridge"
library(wooldridge) # load package; to get data
library(stargazer) # load package; to put regression results into a single stargazer table
attach(wage1) # Allowing objects in the database to be accessed by simply giving their names
str(wage1)
## 'data.frame': 526 obs. of 24 variables:
## $ wage : num 3.1 3.24 3 6 5.3 ...
## $ educ : int 11 12 11 8 12 16 18 12 12 17 ...
## $ exper : int 2 22 2 44 7 9 15 5 26 22 ...
## $ tenure : int 0 2 0 28 2 8 7 3 4 21 ...
## $ nonwhite: int 0 0 0 0 0 0 0 0 0 0 ...
## $ female : int 1 1 0 0 0 0 0 1 1 0 ...
## $ married : int 0 1 0 1 1 1 0 0 0 1 ...
## $ numdep : int 2 3 2 0 1 0 0 0 2 0 ...
## $ smsa : int 1 1 0 1 0 1 1 1 1 1 ...
## $ northcen: int 0 0 0 0 0 0 0 0 0 0 ...
## $ south : int 0 0 0 0 0 0 0 0 0 0 ...
## $ west : int 1 1 1 1 1 1 1 1 1 1 ...
## $ construc: int 0 0 0 0 0 0 0 0 0 0 ...
## $ ndurman : int 0 0 0 0 0 0 0 0 0 0 ...
## $ trcommpu: int 0 0 0 0 0 0 0 0 0 0 ...
## $ trade : int 0 0 1 0 0 0 1 0 1 0 ...
## $ services: int 0 1 0 0 0 0 0 0 0 0 ...
## $ profserv: int 0 0 0 0 0 1 0 0 0 0 ...
## $ profocc : int 0 0 0 0 0 1 1 1 1 1 ...
## $ clerocc : int 0 0 0 1 0 0 0 0 0 0 ...
## $ servocc : int 0 1 0 0 0 0 0 0 0 0 ...
## $ lwage : num 1.13 1.18 1.1 1.79 1.67 ...
## $ expersq : int 4 484 4 1936 49 81 225 25 676 484 ...
## $ tenursq : int 0 4 0 784 4 64 49 9 16 441 ...
## - attr(*, "time.stamp")= chr "25 Jun 2011 23:03"
Description of main variables:
Consider the following multiple regression model: \[ wage_i = \beta_0 + \beta_1 \cdot female_i + \beta_2 \cdot married + \beta_3 \cdot exper_i + \beta_4 \cdot educ_i + u_i.\]
In this multiple regression, we have \(k+1=5\) coefficients. Thus, the degrees of freedom is \(n-(k+1) = 526-5 = 521\).
The OLS estimation of the model:
fit.m <- lm(wage ~ female + married + exper + educ)
summary(fit.m)
##
## Call:
## lm(formula = wage ~ female + married + exper + educ)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4057 -1.9042 -0.5982 1.1454 14.6545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.79066 0.75121 -2.384 0.0175 *
## female -2.06710 0.27221 -7.594 1.45e-13 ***
## married 0.66024 0.29685 2.224 0.0266 *
## exper 0.05567 0.01106 5.035 6.59e-07 ***
## educ 0.58332 0.05166 11.292 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.066 on 521 degrees of freedom
## Multiple R-squared: 0.3158, Adjusted R-squared: 0.3105
## F-statistic: 60.12 on 4 and 521 DF, p-value: < 2.2e-16
Next, let’s discuss the issue of omitted variable bias. Consider the following regression models:
Model 1: \[ wage_i = \beta_0 + \beta_1 \cdot female_i + u_i.\]
Model 2: \[ wage_i = \beta_0 + \beta_1 \cdot female_i + \beta_2 \cdot married + \beta_3 \cdot exper_i + \beta_4 \cdot educ_i + u_i.\]
We run the OLS regressions and summarize the results in a table:
## Run the OLS regressions ##
## Model 1 ##
fit1 <- lm(wage ~ female)
## Model 2 ##
fit2 <- lm(wage ~ female + married + exper + educ)
## Summary of the regression results ##
stargazer(fit1, fit2, title="Regression Results: The Wage Equation", type="text", df=FALSE, digits=2, column.labels = c("fit1", "fit2"))
##
## Regression Results: The Wage Equation
## ================================================
## Dependent variable:
## ----------------------------
## wage
## fit1 fit2
## (1) (2)
## ------------------------------------------------
## female -2.51*** -2.07***
## (0.30) (0.27)
##
## married 0.66**
## (0.30)
##
## exper 0.06***
## (0.01)
##
## educ 0.58***
## (0.05)
##
## Constant 7.10*** -1.79**
## (0.21) (0.75)
##
## ------------------------------------------------
## Observations 526 526
## R2 0.12 0.32
## Adjusted R2 0.11 0.31
## Residual Std. Error 3.48 3.07
## F Statistic 68.54*** 60.12***
## ================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
- Does the coefficient on female in Model 1 suffer from omitted variable bias? Explain.
Yes it seems so. The gender dummy variable is correlated with omitted factors that affect wage such as years of education. In addition, the coefficient rises by roughly \(17.5\%\) (i.e., \((2.51-2.07)/2.51\)) in magnitude when additional regressors are added to Model 1. This change is substantively large and large relative to the standard error in Model 1.
- Suppose the omitted variable bias occurs because the variable \(educ\) is excluded in Model 1. Does the omitted variable bias lead the estimated slope to be too large or too small? Explain.
In Chapter 6, we discussed that the OLS estimator has the following limit if the second and the third least squares assumptions hold but the first does not (i.e., \(\rho_{Xu}\neq 0\)): \[\hat\beta_1 \rightarrow_p \beta_1 + \rho_{Xu} \cdot \frac{\sigma_u}{\sigma_X}.\] The term \(\rho_{Xu} \cdot \frac{\sigma_u}{\sigma_X}\) is the bias in \(\hat\beta_1\).
In Model 1, female is \(X_1\) and educ is an omitted variable \(X_2\). Since female and educ are negatively correlated (i.e., \(X_1'X_2<0\)), and generally educ has a positive effect on positive effect on wage (i.e., \(\beta_2>0\)), the omitted variable bias is negative (i.e., \((X_1'X_1)^{-1} X_1'X_2\beta_2\)). Therefore, the omitted variable bias leads the estimated slope to be too small.
## Correlation between smoker and educ ##
cor(wage1[,c("female", "educ")])
## female educ
## female 1.00000000 -0.08502941
## educ -0.08502941 1.00000000
We run the OLS regressions and summarize the results in a table:
## Run the OLS regressions: adding "educ" ##
## Model 3 ##
fit3 <- lm(wage ~ female + educ)
## Summary of the regression results ##
stargazer(fit1, fit3, title="Regression Results: The Wage Equation", type="text", df=FALSE, digits=2, column.labels = c("fit1", "fit3"))
##
## Regression Results: The Wage Equation
## ================================================
## Dependent variable:
## ----------------------------
## wage
## fit1 fit3
## (1) (2)
## ------------------------------------------------
## female -2.51*** -2.27***
## (0.30) (0.28)
##
## educ 0.51***
## (0.05)
##
## Constant 7.10*** 0.62
## (0.21) (0.67)
##
## ------------------------------------------------
## Observations 526 526
## R2 0.12 0.26
## Adjusted R2 0.11 0.26
## Residual Std. Error 3.48 3.19
## F Statistic 68.54*** 91.32***
## ================================================
## Note: *p<0.1; **p<0.05; ***p<0.01