Introduction

Two-Stage Least Squares (2SLS or TSLS) is a computational method commonly used to estimate instrumental variables (IV) in econometrics, particularly when dealing with endogeneity issues in linear models. The process involves two distinct stages, each with specific computational steps.

Two-Stage Least Squares (2SLS) Methodology

Stage 1: Instrumental Variable Regression

In the first stage of 2SLS, each explanatory variable that is deemed an endogenous covariate is regressed on all the exogenous variables in the model. This includes both the exogenous covariates that are directly part of the equation of interest and any instruments that are excluded from the main regression model.

The regression equation for this stage is modeled as:

\[ X = Z\delta + \text{errors} \]

Where: - \(X\) represents the matrix of endogenous variables. - \(Z\) represents the matrix of exogenous variables and instruments. - \(\delta\) is the matrix of coefficients.

The predicted values from these regressions (\(\hat{X}\)) are calculated using:

\[ \hat{\delta} = (Z^{\text{T}}Z)^{-1}Z^{\text{T}}X \]

Which leads to:

\[ \hat{X} = Z\hat{\delta} = Z(Z^{\text{T}}Z)^{-1}Z^{\text{T}}X = P_Z X \]

Here, \(P_Z\) denotes the projection matrix that projects \(X\) onto the column space of \(Z\).

Stage 2: Main Regression Model

In the second stage, the regression of interest is estimated using the predicted values (\(\hat{X}\)) from the first stage:

\[ Y = \hat{X}\beta + \text{noise} \]

The 2SLS estimator for the coefficients (\(\beta\)) is then given by:

\[ \beta_{\text{2SLS}} = (X^{\text{T}}P_ZX)^{-1}X^{\text{T}}P_ZY \]

This equation emphasizes that \(Y\) is regressed on the predicted values of the endogenous variables corrected for any endogeneity through the instruments.

Limitations and Considerations

This method provides a robust framework for addressing endogeneity in econometric analyses, ensuring that the estimates obtained are consistent under the assumption of valid and strong instruments. Load necessary libraries and set global options.

Data Simulation

First, we simulate a dataset with L endogenous variables, K valid instruments, and a sample size N.

library(MASS)  # Ensure MASS is loaded for mvrnorm
set.seed(123)  # For reproducibility

# Parameters
N <- 1000  # Sample size
L <- 3     # Number of endogenous variables
K <- 5     # Number of instruments


# Simulate instruments
Z <- matrix(rnorm(N * K), ncol = K)
print(dim(Z))
## [1] 1000    5
# Define the mean and covariance matrix for the error terms
mean_errors <- rep(0, L + 1)  # Mean vector for L endogenous errors + 1 equation error
cov_errors <- diag(c(rep(1, L), 1))  # Initial identity matrix for simplicity

# Introduce correlation between the last L errors and the single V error
# Setting the covariance between U's and V
cov_errors[L + 1, 1:L] <- 0.5  # Correlation between V and each U
cov_errors[1:L, L + 1] <- 0.5  # Symmetric

print(cov_errors)
##      [,1] [,2] [,3] [,4]
## [1,]  1.0  0.0  0.0  0.5
## [2,]  0.0  1.0  0.0  0.5
## [3,]  0.0  0.0  1.0  0.5
## [4,]  0.5  0.5  0.5  1.0
# Generate the correlated error terms
errors <- mvrnorm(n = N, mu = mean_errors, Sigma = cov_errors)
U <- errors[, 1:L]  # Error terms for endogenous variables
V <- errors[, L + 1]  # Error term for the main equation
print(dim(U))
## [1] 1000    3
# Generate coefficients and simulate endogenous variables
beta_endog <- runif(L, 1, 5)
gamma <- matrix(runif(L * K, 1, 3), ncol = L)  # Relevance of instruments
print("Beta")
## [1] "Beta"
print(beta_endog)
## [1] 4.893176 1.929759 2.075163
print("Gamma")
## [1] "Gamma"
print(gamma)
##          [,1]     [,2]     [,3]
## [1,] 1.223911 2.993690 1.106875
## [2,] 1.573729 2.175107 1.222462
## [3,] 1.207468 1.470500 1.138694
## [4,] 1.322192 2.074059 2.839018
## [5,] 2.432702 2.817634 2.787707
X <- Z %*% gamma + U  # Endogenous variable construction
print("Dimension for X is:")
## [1] "Dimension for X is:"
print(dim(X))
## [1] 1000    3
# Generate the outcome variable
Y <- X %*% beta_endog + V

# Creating the dataset
data <- data.frame(y = Y, x = X, z = Z)
print(head(data))
##            y        x.1       x.2        x.3         z.1         z.2        z.3
## 1 -30.052708 -3.3615722 -3.123761 -3.3569627 -0.56047565 -0.99579872 -0.5116037
## 2   5.066374  1.2285438 -1.226604  0.3858558 -0.23017749 -1.03995504  0.2369379
## 3  -2.597754 -0.5166005  0.761507 -0.4203349  1.55870831 -0.01798024 -0.5415892
## 4 -19.574134 -1.9110597 -2.862495 -2.9179463  0.07050839 -0.13217513  1.2192276
## 5 -30.635058 -4.1729122 -6.222149  0.4030339  0.12928774 -2.54934277  0.1741359
## 6  82.822186  7.9428974 13.111449  8.7859670  1.71506499  1.04057346 -0.6152683
##           z.4        z.5
## 1 -0.15030748  0.1965498
## 2 -0.32775713  0.6501132
## 3 -1.44816529  0.6710042
## 4 -0.69728458 -1.2841578
## 5  2.59849023 -2.0261096
## 6 -0.03741501  2.2053261

Two-Stage Least Squares (2SLS) Estimation

We will now perform 2SLS to estimate the effect of the endogenous variables on the outcome.

# Define the formula
formula_iv <- y ~ x.1 + x.2 + x.3 | z.1 + z.2 + z.3 + z.4 + z.5
 
# Perform IV regression using 2SLS
iv_model <- ivreg(formula_iv, data = data)

# Summary of the IV model
summary(iv_model)
## 
## Call:
## ivreg(formula = formula_iv, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.81465 -0.68956  0.01088  0.69263  3.40721 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.02732    0.03171   0.862    0.389    
## x.1          4.82809    0.03946 122.352   <2e-16 ***
## x.2          1.93229    0.02165  89.238   <2e-16 ***
## x.3          2.12260    0.02095 101.301   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.001 on 996 degrees of freedom
## Multiple R-Squared: 0.9993,  Adjusted R-squared: 0.9993 
## Wald test: 4.806e+05 on 3 and 996 DF,  p-value: < 2.2e-16
P_z = Z %*% solve(t(Z) %*% Z ) %*% t(Z)

solve(t(X) %*% P_z %*% X, t(X) %*% P_z %*% Y)
##          [,1]
## [1,] 4.827855
## [2,] 1.932762
## [3,] 2.122161

Pack the data generating process into a function

Now we write the data generating process into a function

generate_data <- function(N,L,K,beta_endog,gamma,cov_errrors) {

  cov_errors <- diag(c(rep(1, L), 1))  # Initial identity matrix for simplicity
  
  # Introduce correlation between the last L errors and the single V error
  # Setting the covariance between U's and V
  cov_errors[L + 1, 1:L] <- 0.5  # Correlation between V and each U
  cov_errors[1:L, L + 1] <- 0.5  # Symmetric
  mean_errors <- rep(0, L + 1)  # Mean vector for L endogenous errors + 1 equation error
  
  
  errors <- mvrnorm(n = N, mu = mean_errors, Sigma = cov_errors)
  U <- errors[, 1:L]  # Error terms for endogenous variables
  V <- errors[, L + 1]  # Error term for the main equation
  # Simulate instruments
  Z <- matrix(rnorm(N * K), ncol = K)
  X <- Z %*% gamma + U  
  
  Y <- X %*% beta_endog + V
  
  # Creating the dataset
  data <- list(y = Y, x = X, z = Z)
}

# Parameters
N <- 1000  # Sample size
L <- 3     # Number of endogenous variables
K <- 5     # Number of instruments

beta_endog <- c(1,2,3)
gamma <- matrix(runif(L * K, 1, 3), ncol = L)  # Relevance of instruments

data <-generate_data(N,L,K,beta_endog,gamma,cov_errrors) 

Define the estimation funciton

Return the estimated 2SLS \(\hat{beta}\) and the 95 % confidence interval for 2SLS.

estim_2sls <- function(data){
  Y <- data$y
  X <- data$x
  Z <- data$z
  N <- length(Y)
  K <- ncol(X)
  P_z = Z %*% solve(t(Z) %*% Z ) %*% t(Z)
  beta_2sls <- solve(t(X) %*% P_z %*% X, t(X) %*% P_z %*% Y)  
  # construct the residual
  eps_hat <- Y - X %*% beta_2sls
  sigma_2_2sls <- t(eps_hat) %*% eps_hat / (N-1-K) 
  sd_beta_2sls <- sqrt(diag(solve(t(X) %*% P_z %*% X) * sigma_2_2sls[1]))
  CI_95_ub <- beta_2sls + 1.96 * sd_beta_2sls
  CI_95_lb <- beta_2sls - 1.96 * sd_beta_2sls
  return(list(beta=beta_2sls,lb=CI_95_lb,ub=CI_95_ub))
}

Monte Carlo Simulation

library(miscTools)
nMC <- 100

beta_estim <- matrix(0, nrow=nMC, ncol = L)
beta_bias <- matrix(0, nrow=nMC, ncol = L)
beta_cov <- matrix(0, nrow=nMC, ncol = L)
for (m in 1:nMC){
  data_m <- generate_data(N,L,K,beta_endog,gamma,cov_errrors)
  result_m <- estim_2sls(data_m)
  coverage_m <- (beta_endog > result_m$lb) * (beta_endog < result_m$ub)
  beta_bias[m,] <- result_m$beta -  beta_endog
  beta_cov[m,] <- coverage_m
}

print("Mean Absolute Bias")
## [1] "Mean Absolute Bias"
print(colMeans(abs(beta_bias)))
## [1] 0.03715533 0.01666679 0.02951449
print("Median Bias")
## [1] "Median Bias"
print(colMedians(abs(beta_bias)))
## [1] 0.02906062 0.01302938 0.02515336
print("%5 Rejection Probability")
## [1] "%5 Rejection Probability"
print(1-colMeans(beta_cov))
## [1] 0.04 0.05 0.03