Accounting researchers often use two-step regression approaches. There is a variety of such approaches. Clearly, 2SLS is one case. Another approach is the use of residuals from one regression in the second-step (main) regression: excess compensation, excess accruals, etc.
I use the following simulation to examine two questions. First, what's the impact on estimated standard errors? Second, what's the impact on estimated coefficients?
The code that produced this document can be found here.
# Set up parallel processing
library(doMC)
library(foreach)
registerDoMC(cores = multicore:::detectCores())
# Number of simulations run in analyses
M <- 1000
# Parameter used in second analysis
gamma <- 0.5
# Parameters
simulation <- function(gamma_32) {
# Hard-coded parameter values
sigma_e <- 0.4
sigma_z <- 1
N <- 1000
b_1 <- 0
b_2 <- 3
b_3 <- 2
gamma_21 <- 0.5
# Generate random variables
z_1 <- rnorm(N, sd = sigma_z)
z_2 <- rnorm(N, sd = sigma_z)
z_3 <- rnorm(N, sd = sigma_z)
e <- rnorm(N, 0, sd = sigma_e)
# Make data set
x_1 <- z_1
x_2 <- z_2 + gamma_21 * z_1
x_3 <- z_3 + gamma_32 * z_2
y <- x_1 * b_1 + x_2 * b_2 + x_3 * b_3 + e
return(data.frame(y, x_1, x_2, x_3))
}
The first question is: What is the impact on standard errors of using a two-step approach? The answer appears to be: It increases them. The following is a single run of the simulation, but appears consistent with what I saw in other runs. Essentially, the vaiance of the regression error term is blown up dramatically by the omission of \( x_2 \) from the second-step.
Note that by setting \( \gamma_{32} \), the coefficient on \( z_2 \) used in generating \( x_3 \), we have \( x_2 \) and \( x_3 \) as independent random variables and this means we get no bias in inference from using the two-step approach.
# Two-step approach
reg.data <- simulation(gamma_32 = 0)
first.step <- lm(x_1 ~ x_2, data = reg.data)
x_1_resid <- first.step$residuals
summary(lm(y ~ x_1_resid + x_3, data = reg.data))
##
## Call:
## lm(formula = y ~ x_1_resid + x_3, data = reg.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.266 -2.299 0.034 2.489 9.993
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0711 0.1094 -0.65 0.52
## x_1_resid -0.0108 0.1216 -0.09 0.93
## x_3 1.9159 0.1088 17.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.46 on 997 degrees of freedom
## Multiple R-squared: 0.238, Adjusted R-squared: 0.236
## F-statistic: 155 on 2 and 997 DF, p-value: <2e-16
# One-step approach
summary(lm(y ~ x_1 + x_2 + x_3, data = reg.data))
##
## Call:
## lm(formula = y ~ x_1 + x_2 + x_3, data = reg.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.278 -0.272 -0.013 0.264 1.390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0176 0.0128 1.38 0.17
## x_1 -0.0146 0.0142 -1.03 0.30
## x_2 2.9919 0.0126 237.14 <2e-16 ***
## x_3 2.0082 0.0127 158.57 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.403 on 996 degrees of freedom
## Multiple R-squared: 0.99, Adjusted R-squared: 0.99
## F-statistic: 3.18e+04 on 3 and 996 DF, p-value: <2e-16
Now let's do that 1000 times and plot the resulting standard errors from each of the two approaches.
sim_1 <- function(n) {
require(lmtest)
reg.data <- simulation(gamma_32=0)
first.step <- lm(x_1 ~ x_2, data=reg.data)
x_1_resid <-first.step$residuals
fitted.2s <- lm(y ~ x_1_resid + x_3, data=reg.data)
fitted.full <- lm(y ~ x_1 + x_2 + x_3, data=reg.data)
results <- data.frame(n=n,
beta.s2=coeftest(fitted.2s)[2,1],
se.s2=coeftest(fitted.2s)[2,2],
beta.full=coeftest(fitted.full)[2,1],
se.full=coeftest(fitted.full)[2,2])
return(results)
}
sim_1_results <-
foreach (i = 1:M, .combine="rbind") %dopar% {
sim_1(i)
}
library(ggplot2)
p <- qplot(se.full, se.s2, data=sim_1_results)
p + stat_smooth(method="lm", se=FALSE)
The average standard error from the two-step approach is 8.446 times greater that that for the full regression approach.
An important point to note is that while the estimated standard errors increase, an important question is whether they understate or overstate the true standard error. It seems that they overstate it: the standard deviation of the estimated coefficients on \( x_1 \) is 0.015, while the average estimated standard error is 0.120 (i.e., 8.0 times greater). In contrast for the full model, the standard deviation of the estimated coefficients on \( x_1 \) is 0.014, consistent with the average estimated standard error of 0.014.
To examine the second question, we let , the coefficient on \( z_2 \) used in generating \( x_3 \), \( \gamma_{32}=0.5 \). Now \( x_2 \) and \( x_3 \) are correlated and we get significant bias in inference from using the two-step approach. Specifically the coefficient on \( x_1 \) loads even though it's not in the true model. Using a single regression produces correct inferences.
# Two-step approach
reg.data <- simulation(gamma_32 = gamma)
first.step <- lm(x_1 ~ x_2, data = reg.data)
x_1_resid <- first.step$residuals
summary(lm(y ~ x_1_resid + x_3, data = reg.data))
##
## Call:
## lm(formula = y ~ x_1_resid + x_3, data = reg.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.301 -2.211 -0.078 2.193 10.092
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.00934 0.10306 -0.09 0.9278
## x_1_resid 0.34591 0.12029 2.88 0.0041 **
## x_3 3.27027 0.09539 34.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.26 on 997 degrees of freedom
## Multiple R-squared: 0.546, Adjusted R-squared: 0.545
## F-statistic: 599 on 2 and 997 DF, p-value: <2e-16
# One-step approach
summary(lm(y ~ x_1 + x_2 + x_3, data = reg.data))
##
## Call:
## lm(formula = y ~ x_1 + x_2 + x_3, data = reg.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1001 -0.2724 0.0017 0.2666 1.2407
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.00768 0.01215 0.63 0.53
## x_1 -0.00364 0.01424 -0.26 0.80
## x_2 2.99629 0.01289 232.54 <2e-16 ***
## x_3 1.99998 0.01222 163.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.384 on 996 degrees of freedom
## Multiple R-squared: 0.994, Adjusted R-squared: 0.994
## F-statistic: 5.23e+04 on 3 and 996 DF, p-value: <2e-16
Now let's do that 1000 times and plot the t-statistics from the two-step approach.
sim_2 <- function(n) {
require(lmtest)
reg.data <- simulation(gamma_32=gamma)
first.step <- lm(x_1 ~ x_2, data=reg.data)
x_1_resid <-first.step$residuals
fitted.2s <- lm(y ~ x_1_resid + x_3, data=reg.data)
fitted.full <- lm(y ~ x_1 + x_2 + x_3, data=reg.data)
return(data.frame(n=n,
t.stat.2s=coeftest(fitted.2s)[2,3],
p.val.2s=coeftest(fitted.2s)[2,4],
t.stat.full=coeftest(fitted.full)[2,3],
p.val.full=coeftest(fitted.full)[2,4]))
}
sim_2_results <-
foreach (i = 1:M, .combine="rbind") %dopar% {
sim_2(i)
}
sig.level <- 0.05
sim_2_results$significant.2s <- sim_2_results$p.val.2s < sig.level
sim_2_results$significant.full <- sim_2_results$p.val.full < sig.level
library(ggplot2)
qplot(t.stat.2s, data=sim_2_results, fill=significant.2s, binwidth=0.03)
The two-step approach yields statistically significant results in 96.6% of cases (at the 5% level with a two-sided test) versus 5.9% cases with the full regression approach.