L6 (Lab) - The Multiple Regression

Exercise: The Gender Wage Gap

The gender pay gap or gender wage gap is the average difference between the remuneration for men and women who are working. Women are generally considered to be paid less than men. In this exercise, we are going to revisit this issue using WAGE1. This data set is part of the R package wooldridge.

Clear the Workspace

rm(list=ls())

Install and Load Needed Packages

Let’s load all the packages needed for this exercise (this assumes you’ve already installed them).

#install.packages("wooldridge")        # install R package "wooldridge"
library(wooldridge)                   # load package; to get data 
library(stargazer)              # load package; to put regression results into a single stargazer table

Import Data: WAGE1

attach(wage1)   # Allowing objects in the database to be accessed by simply giving their names
str(wage1)

## 'data.frame':    526 obs. of  24 variables:
##  $ wage    : num  3.1 3.24 3 6 5.3 ...
##  $ educ    : int  11 12 11 8 12 16 18 12 12 17 ...
##  $ exper   : int  2 22 2 44 7 9 15 5 26 22 ...
##  $ tenure  : int  0 2 0 28 2 8 7 3 4 21 ...
##  $ nonwhite: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ female  : int  1 1 0 0 0 0 0 1 1 0 ...
##  $ married : int  0 1 0 1 1 1 0 0 0 1 ...
##  $ numdep  : int  2 3 2 0 1 0 0 0 2 0 ...
##  $ smsa    : int  1 1 0 1 0 1 1 1 1 1 ...
##  $ northcen: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ south   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ west    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ construc: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ndurman : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ trcommpu: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ trade   : int  0 0 1 0 0 0 1 0 1 0 ...
##  $ services: int  0 1 0 0 0 0 0 0 0 0 ...
##  $ profserv: int  0 0 0 0 0 1 0 0 0 0 ...
##  $ profocc : int  0 0 0 0 0 1 1 1 1 1 ...
##  $ clerocc : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ servocc : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ lwage   : num  1.13 1.18 1.1 1.79 1.67 ...
##  $ expersq : int  4 484 4 1936 49 81 225 25 676 484 ...
##  $ tenursq : int  0 4 0 784 4 64 49 9 16 441 ...
##  - attr(*, "time.stamp")= chr "25 Jun 2011 23:03"

Description of main variables:

wage: average hourly earnings.
exper: years potential experience.
educ: years of education.
female: $=1$ if female.
married: $=1$ if married.

Consider the following multiple regression model: \[ wage_i = \beta_0 + \beta_1 \cdot female_i + \beta_2 \cdot married + \beta_3 \cdot exper_i + \beta_4 \cdot educ_i + u_i.\]

In this multiple regression, we have $k+1=5$ coefficients. Thus, the degrees of freedom is $n-(k+1) = 526-5 = 521$.

The OLS estimation of the model:

fit.m <- lm(wage ~ female + married + exper + educ)
summary(fit.m)

## 
## Call:
## lm(formula = wage ~ female + married + exper + educ)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4057 -1.9042 -0.5982  1.1454 14.6545 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.79066    0.75121  -2.384   0.0175 *  
## female      -2.06710    0.27221  -7.594 1.45e-13 ***
## married      0.66024    0.29685   2.224   0.0266 *  
## exper        0.05567    0.01106   5.035 6.59e-07 ***
## educ         0.58332    0.05166  11.292  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.066 on 521 degrees of freedom
## Multiple R-squared:  0.3158, Adjusted R-squared:  0.3105 
## F-statistic: 60.12 on 4 and 521 DF,  p-value: < 2.2e-16

Estimation Results:

For each coefficent: OLS estimate, standard error, t-statistic, p-value
$SER$: $3.066$
$R^2$: $0.3158$
$\bar R^2$: $0.3105$

Interpretation of OLS Estimation:

average return to work experience: $\hat\beta_3 = 0.06$
average return to education: $\hat\beta_4 = 0.58$
marriage premium: $\hat\beta_2 = 0.66$
gender wage gap: $\hat\beta_1 = -2.07$

Next, let’s discuss the issue of omitted variable bias. Consider the following regression models:

Model 1: \[ wage_i = \beta_0 + \beta_1 \cdot female_i + u_i.\]
Model 2: \[ wage_i = \beta_0 + \beta_1 \cdot female_i + \beta_2 \cdot married + \beta_3 \cdot exper_i + \beta_4 \cdot educ_i + u_i.\]

We run the OLS regressions and summarize the results in a table:

## Run the OLS regressions ##
  ## Model 1 ##
fit1 <- lm(wage ~ female)   
  
  ## Model 2 ##
fit2 <- lm(wage ~ female + married + exper + educ)


## Summary of the regression results ##
stargazer(fit1, fit2, title="Regression Results: The Wage Equation", type="text", df=FALSE, digits=2, column.labels = c("fit1", "fit2"))

## 
## Regression Results: The Wage Equation
## ================================================
##                         Dependent variable:     
##                     ----------------------------
##                                 wage            
##                          fit1          fit2     
##                          (1)            (2)     
## ------------------------------------------------
## female                 -2.51***      -2.07***   
##                         (0.30)        (0.27)    
##                                                 
## married                               0.66**    
##                                       (0.30)    
##                                                 
## exper                                 0.06***   
##                                       (0.01)    
##                                                 
## educ                                  0.58***   
##                                       (0.05)    
##                                                 
## Constant               7.10***        -1.79**   
##                         (0.21)        (0.75)    
##                                                 
## ------------------------------------------------
## Observations             526            526     
## R2                       0.12          0.32     
## Adjusted R2              0.11          0.31     
## Residual Std. Error      3.48          3.07     
## F Statistic            68.54***      60.12***   
## ================================================
## Note:                *p<0.1; **p<0.05; ***p<0.01

Does the coefficient on female in Model 1 suffer from omitted variable bias? Explain.

Yes it seems so. The gender dummy variable is correlated with omitted factors that affect wage such as years of education. In addition, the coefficient rises by roughly $17.5\%$ (i.e., $(2.51-2.07)/2.51$) in magnitude when additional regressors are added to Model 1. This change is substantively large and large relative to the standard error in Model 1.

Suppose the omitted variable bias occurs because the variable $educ$ is excluded in Model 1. Does the omitted variable bias lead the estimated slope to be too large or too small? Explain.

In Chapter 6, we discussed that the OLS estimator has the following limit if the second and the third least squares assumptions hold but the first does not (i.e., $\rho_{Xu}\neq 0$): \[\hat\beta_1 \rightarrow_p \beta_1 + \rho_{Xu} \cdot \frac{\sigma_u}{\sigma_X}.\] The term $\rho_{Xu} \cdot \frac{\sigma_u}{\sigma_X}$ is the bias in $\hat\beta_1$.

Whether this bias is large or small in practice depends on $|\rho_{Xu}|$, i.e., the correlation between the regressor and the error term. The larger $|\rho_{Xu}|$ is, the larger the bias.

The direction of the bias in $\hat\beta_1$ depends on
- whether $X$ and the omitted variable are positively or negatively correlated; and
- whether the omitted variable has a positive effect or a negative effect on dependent variable.

Example: Suppose the true model is \[Y = X_1\beta_1 + X_2\beta_2 + v.\] But (for some reason) we omit $X_2$ and estimate the model as: \[Y = X_1\beta_1 + u.\] For the OLS estimator $\hat\beta_1$, we find its bias is \[(X_1'X_1)^{-1} X_1'X_2\beta_2,\] where $X_1'X_2$ captures the correlation between $X_1$ and the omitted variable $X_2$, and $\beta_2$ indicates the effect of the omitted variable on $Y$. See the following equations for a derivation of the bias in $\hat\beta_1$. Given $X_1$ and $X_2$, we have: \[\begin{aligned} E(\hat\beta_1 | X_1, X_2) &= E\left[ (X_1'X_1)^{-1} X_1'Y | X_1, X_2\right]\\ &= E\left[ (X_1'X_1)^{-1} X_1'(X_1\beta_1 + u) | X_1, X_2\right]\\ &= E\left[ (X_1'X_1)^{-1} X_1'(X_1\beta_1 + X_2\beta_2 + v) | X_1, X_2 \right]\\ &= \beta_1 + (X_1'X_1)^{-1} X_1'X_2 \beta_2 + (X_1'X_1)^{-1} X_1'E(v|X_1, X_2) \\ &= \beta_1 + (X_1'X_1)^{-1} X_1'X_2\beta_2 \quad (\text{since $E(v|X_1,X_2)=0$}) \end{aligned}\]

In Model 1, female is $X_1$ and educ is an omitted variable $X_2$. Since female and educ are negatively correlated (i.e., $X_1'X_2<0$), and generally educ has a positive effect on positive effect on wage (i.e., $\beta_2>0$), the omitted variable bias is negative (i.e., $(X_1'X_1)^{-1} X_1'X_2\beta_2$). Therefore, the omitted variable bias leads the estimated slope to be too small.

## Correlation between smoker and educ ##
cor(wage1[,c("female", "educ")])

##             female        educ
## female  1.00000000 -0.08502941
## educ   -0.08502941  1.00000000