Two-Stage Least Squares (2SLS or TSLS) is a computational method commonly used to estimate instrumental variables (IV) in econometrics, particularly when dealing with endogeneity issues in linear models. The process involves two distinct stages, each with specific computational steps.
In the first stage of 2SLS, each explanatory variable that is deemed an endogenous covariate is regressed on all the exogenous variables in the model. This includes both the exogenous covariates that are directly part of the equation of interest and any instruments that are excluded from the main regression model.
The regression equation for this stage is modeled as:
\[ X = Z\delta + \text{errors} \]
Where: - \(X\) represents the matrix of endogenous variables. - \(Z\) represents the matrix of exogenous variables and instruments. - \(\delta\) is the matrix of coefficients.
The predicted values from these regressions (\(\hat{X}\)) are calculated using:
\[ \hat{\delta} = (Z^{\text{T}}Z)^{-1}Z^{\text{T}}X \]
Which leads to:
\[ \hat{X} = Z\hat{\delta} = Z(Z^{\text{T}}Z)^{-1}Z^{\text{T}}X = P_Z X \]
Here, \(P_Z\) denotes the projection matrix that projects \(X\) onto the column space of \(Z\).
In the second stage, the regression of interest is estimated using the predicted values (\(\hat{X}\)) from the first stage:
\[ Y = \hat{X}\beta + \text{noise} \]
The 2SLS estimator for the coefficients (\(\beta\)) is then given by:
\[ \beta_{\text{2SLS}} = (X^{\text{T}}P_ZX)^{-1}X^{\text{T}}P_ZY \]
This equation emphasizes that \(Y\) is regressed on the predicted values of the endogenous variables corrected for any endogeneity through the instruments.
Model Validity: The 2SLS method is valid primarily for linear models. Special consideration is needed when dealing with categorical endogenous variables. In such cases, using alternative estimation techniques for the first stage (like a probit model) followed by OLS in the second stage might lead to inconsistent estimates unless specific conditions are met. This approach is often referred to in the econometric literature as the “forbidden regression” because its validity is conditional.
Instrument Strength: The validity of 2SLS estimates critically depends on the strength and validity of the instruments. Weak instruments can lead to biased and inconsistent parameter estimates.
This method provides a robust framework for addressing endogeneity in econometric analyses, ensuring that the estimates obtained are consistent under the assumption of valid and strong instruments. Load necessary libraries and set global options.
First, we simulate a dataset with L endogenous variables, K valid instruments, and a sample size N.
library(MASS) # Ensure MASS is loaded for mvrnorm
set.seed(123) # For reproducibility
# Parameters
N <- 1000 # Sample size
L <- 3 # Number of endogenous variables
K <- 5 # Number of instruments
# Simulate instruments
Z <- matrix(rnorm(N * K), ncol = K)
print(dim(Z))
## [1] 1000 5
# Define the mean and covariance matrix for the error terms
mean_errors <- rep(0, L + 1) # Mean vector for L endogenous errors + 1 equation error
cov_errors <- diag(c(rep(1, L), 1)) # Initial identity matrix for simplicity
# Introduce correlation between the last L errors and the single V error
# Setting the covariance between U's and V
cov_errors[L + 1, 1:L] <- 0.5 # Correlation between V and each U
cov_errors[1:L, L + 1] <- 0.5 # Symmetric
print(cov_errors)
## [,1] [,2] [,3] [,4]
## [1,] 1.0 0.0 0.0 0.5
## [2,] 0.0 1.0 0.0 0.5
## [3,] 0.0 0.0 1.0 0.5
## [4,] 0.5 0.5 0.5 1.0
# Generate the correlated error terms
errors <- mvrnorm(n = N, mu = mean_errors, Sigma = cov_errors)
U <- errors[, 1:L] # Error terms for endogenous variables
V <- errors[, L + 1] # Error term for the main equation
print(dim(U))
## [1] 1000 3
# Generate coefficients and simulate endogenous variables
beta_endog <- runif(L, 1, 5)
gamma <- matrix(runif(L * K, 1, 3), ncol = L) # Relevance of instruments
print("Beta")
## [1] "Beta"
print(beta_endog)
## [1] 4.893176 1.929759 2.075163
print("Gamma")
## [1] "Gamma"
print(gamma)
## [,1] [,2] [,3]
## [1,] 1.223911 2.993690 1.106875
## [2,] 1.573729 2.175107 1.222462
## [3,] 1.207468 1.470500 1.138694
## [4,] 1.322192 2.074059 2.839018
## [5,] 2.432702 2.817634 2.787707
X <- Z %*% gamma + U # Endogenous variable construction
print("Dimension for X is:")
## [1] "Dimension for X is:"
print(dim(X))
## [1] 1000 3
# Generate the outcome variable
Y <- X %*% beta_endog + V
# Creating the dataset
data <- data.frame(y = Y, x = X, z = Z)
print(head(data))
## y x.1 x.2 x.3 z.1 z.2 z.3
## 1 -30.052708 -3.3615722 -3.123761 -3.3569627 -0.56047565 -0.99579872 -0.5116037
## 2 5.066374 1.2285438 -1.226604 0.3858558 -0.23017749 -1.03995504 0.2369379
## 3 -2.597754 -0.5166005 0.761507 -0.4203349 1.55870831 -0.01798024 -0.5415892
## 4 -19.574134 -1.9110597 -2.862495 -2.9179463 0.07050839 -0.13217513 1.2192276
## 5 -30.635058 -4.1729122 -6.222149 0.4030339 0.12928774 -2.54934277 0.1741359
## 6 82.822186 7.9428974 13.111449 8.7859670 1.71506499 1.04057346 -0.6152683
## z.4 z.5
## 1 -0.15030748 0.1965498
## 2 -0.32775713 0.6501132
## 3 -1.44816529 0.6710042
## 4 -0.69728458 -1.2841578
## 5 2.59849023 -2.0261096
## 6 -0.03741501 2.2053261
We will now perform 2SLS to estimate the effect of the endogenous variables on the outcome.
# Define the formula
formula_iv <- y ~ x.1 + x.2 + x.3 | z.1 + z.2 + z.3 + z.4 + z.5
# Perform IV regression using 2SLS
iv_model <- ivreg(formula_iv, data = data)
# Summary of the IV model
summary(iv_model)
##
## Call:
## ivreg(formula = formula_iv, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.81465 -0.68956 0.01088 0.69263 3.40721
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.02732 0.03171 0.862 0.389
## x.1 4.82809 0.03946 122.352 <2e-16 ***
## x.2 1.93229 0.02165 89.238 <2e-16 ***
## x.3 2.12260 0.02095 101.301 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.001 on 996 degrees of freedom
## Multiple R-Squared: 0.9993, Adjusted R-squared: 0.9993
## Wald test: 4.806e+05 on 3 and 996 DF, p-value: < 2.2e-16
P_z = Z %*% solve(t(Z) %*% Z ) %*% t(Z)
solve(t(X) %*% P_z %*% X, t(X) %*% P_z %*% Y)
## [,1]
## [1,] 4.827855
## [2,] 1.932762
## [3,] 2.122161
Now we write the data generating process into a function
generate_data <- function(N,L,K,beta_endog,gamma,cov_errrors) {
cov_errors <- diag(c(rep(1, L), 1)) # Initial identity matrix for simplicity
# Introduce correlation between the last L errors and the single V error
# Setting the covariance between U's and V
cov_errors[L + 1, 1:L] <- 0.5 # Correlation between V and each U
cov_errors[1:L, L + 1] <- 0.5 # Symmetric
mean_errors <- rep(0, L + 1) # Mean vector for L endogenous errors + 1 equation error
errors <- mvrnorm(n = N, mu = mean_errors, Sigma = cov_errors)
U <- errors[, 1:L] # Error terms for endogenous variables
V <- errors[, L + 1] # Error term for the main equation
# Simulate instruments
Z <- matrix(rnorm(N * K), ncol = K)
X <- Z %*% gamma + U
Y <- X %*% beta_endog + V
# Creating the dataset
data <- list(y = Y, x = X, z = Z)
}
# Parameters
N <- 1000 # Sample size
L <- 3 # Number of endogenous variables
K <- 5 # Number of instruments
beta_endog <- c(1,2,3)
gamma <- matrix(runif(L * K, 1, 3), ncol = L) # Relevance of instruments
data <-generate_data(N,L,K,beta_endog,gamma,cov_errrors)
Return the estimated 2SLS \(\hat{beta}\) and the 95 % confidence interval for 2SLS.
estim_2sls <- function(data){
Y <- data$y
X <- data$x
Z <- data$z
N <- length(Y)
K <- ncol(X)
P_z = Z %*% solve(t(Z) %*% Z ) %*% t(Z)
beta_2sls <- solve(t(X) %*% P_z %*% X, t(X) %*% P_z %*% Y)
# construct the residual
eps_hat <- Y - X %*% beta_2sls
sigma_2_2sls <- t(eps_hat) %*% eps_hat / (N-1-K)
sd_beta_2sls <- sqrt(diag(solve(t(X) %*% P_z %*% X) * sigma_2_2sls[1]))
CI_95_ub <- beta_2sls + 1.96 * sd_beta_2sls
CI_95_lb <- beta_2sls - 1.96 * sd_beta_2sls
return(list(beta=beta_2sls,lb=CI_95_lb,ub=CI_95_ub))
}
library(miscTools)
nMC <- 100
beta_estim <- matrix(0, nrow=nMC, ncol = L)
beta_bias <- matrix(0, nrow=nMC, ncol = L)
beta_cov <- matrix(0, nrow=nMC, ncol = L)
for (m in 1:nMC){
data_m <- generate_data(N,L,K,beta_endog,gamma,cov_errrors)
result_m <- estim_2sls(data_m)
coverage_m <- (beta_endog > result_m$lb) * (beta_endog < result_m$ub)
beta_bias[m,] <- result_m$beta - beta_endog
beta_cov[m,] <- coverage_m
}
print("Mean Absolute Bias")
## [1] "Mean Absolute Bias"
print(colMeans(abs(beta_bias)))
## [1] 0.03715533 0.01666679 0.02951449
print("Median Bias")
## [1] "Median Bias"
print(colMedians(abs(beta_bias)))
## [1] 0.02906062 0.01302938 0.02515336
print("%5 Rejection Probability")
## [1] "%5 Rejection Probability"
print(1-colMeans(beta_cov))
## [1] 0.04 0.05 0.03