Two-step regressions in accounting research

Accounting researchers often use two-step regression approaches. There is a variety of such approaches. Clearly, 2SLS is one case. Another approach is the use of residuals from one regression in the second-step (main) regression: excess compensation, excess accruals, etc.

I use the following simulation to examine two questions. First, what's the impact on estimated standard errors? Second, what's the impact on estimated coefficients?

The code that produced this document can be found here.

Simulation set-up

# Set up parallel processing
library(doMC)
library(foreach)
registerDoMC(cores = multicore:::detectCores())

# Number of simulations run in analyses
M <- 1000

# Parameter used in second analysis
gamma <- 0.5

# Parameters
simulation <- function(gamma_32) {

    # Hard-coded parameter values
    sigma_e <- 0.4
    sigma_z <- 1
    N <- 1000

    b_1 <- 0
    b_2 <- 3
    b_3 <- 2

    gamma_21 <- 0.5

    # Generate random variables
    z_1 <- rnorm(N, sd = sigma_z)
    z_2 <- rnorm(N, sd = sigma_z)
    z_3 <- rnorm(N, sd = sigma_z)
    e <- rnorm(N, 0, sd = sigma_e)

    # Make data set
    x_1 <- z_1
    x_2 <- z_2 + gamma_21 * z_1
    x_3 <- z_3 + gamma_32 * z_2
    y <- x_1 * b_1 + x_2 * b_2 + x_3 * b_3 + e

    return(data.frame(y, x_1, x_2, x_3))
}

Impact on standard errors

The first question is: What is the impact on standard errors of using a two-step approach? The answer appears to be: It increases them. The following is a single run of the simulation, but appears consistent with what I saw in other runs. Essentially, the vaiance of the regression error term is blown up dramatically by the omission of \( x_2 \) from the second-step.

Note that by setting \( \gamma_{32} \), the coefficient on \( z_2 \) used in generating \( x_3 \), we have \( x_2 \) and \( x_3 \) as independent random variables and this means we get no bias in inference from using the two-step approach.

# Two-step approach
reg.data <- simulation(gamma_32 = 0)
first.step <- lm(x_1 ~ x_2, data = reg.data)
x_1_resid <- first.step$residuals
summary(lm(y ~ x_1_resid + x_3, data = reg.data))
## 
## Call:
## lm(formula = y ~ x_1_resid + x_3, data = reg.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.266  -2.299   0.034   2.489   9.993 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.0711     0.1094   -0.65     0.52    
## x_1_resid    -0.0108     0.1216   -0.09     0.93    
## x_3           1.9159     0.1088   17.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.46 on 997 degrees of freedom
## Multiple R-squared:  0.238,  Adjusted R-squared:  0.236 
## F-statistic:  155 on 2 and 997 DF,  p-value: <2e-16
# One-step approach
summary(lm(y ~ x_1 + x_2 + x_3, data = reg.data))
## 
## Call:
## lm(formula = y ~ x_1 + x_2 + x_3, data = reg.data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.278 -0.272 -0.013  0.264  1.390 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.0176     0.0128    1.38     0.17    
## x_1          -0.0146     0.0142   -1.03     0.30    
## x_2           2.9919     0.0126  237.14   <2e-16 ***
## x_3           2.0082     0.0127  158.57   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.403 on 996 degrees of freedom
## Multiple R-squared:  0.99,   Adjusted R-squared:  0.99 
## F-statistic: 3.18e+04 on 3 and 996 DF,  p-value: <2e-16

Now let's do that 1000 times and plot the resulting standard errors from each of the two approaches.

sim_1 <- function(n) {
    require(lmtest)
    reg.data <- simulation(gamma_32=0)
    first.step <- lm(x_1 ~ x_2, data=reg.data)
    x_1_resid <-first.step$residuals
    fitted.2s <- lm(y ~ x_1_resid + x_3,  data=reg.data)
    fitted.full <- lm(y ~ x_1 + x_2 + x_3,  data=reg.data)
    results <- data.frame(n=n, 
                      beta.s2=coeftest(fitted.2s)[2,1],
                      se.s2=coeftest(fitted.2s)[2,2],
                      beta.full=coeftest(fitted.full)[2,1],
                      se.full=coeftest(fitted.full)[2,2])
    return(results)
}

sim_1_results <- 
    foreach (i = 1:M, .combine="rbind") %dopar% {
    sim_1(i)
}

library(ggplot2)
p <- qplot(se.full, se.s2, data=sim_1_results)
p + stat_smooth(method="lm", se=FALSE)

plot of chunk se_impact_reps

The average standard error from the two-step approach is 8.446 times greater that that for the full regression approach.

An important point to note is that while the estimated standard errors increase, an important question is whether they understate or overstate the true standard error. It seems that they overstate it: the standard deviation of the estimated coefficients on \( x_1 \) is 0.015, while the average estimated standard error is 0.120 (i.e., 8.0 times greater). In contrast for the full model, the standard deviation of the estimated coefficients on \( x_1 \) is 0.014, consistent with the average estimated standard error of 0.014.

Impact on inferences

To examine the second question, we let , the coefficient on \( z_2 \) used in generating \( x_3 \), \( \gamma_{32}=0.5 \). Now \( x_2 \) and \( x_3 \) are correlated and we get significant bias in inference from using the two-step approach. Specifically the coefficient on \( x_1 \) loads even though it's not in the true model. Using a single regression produces correct inferences.

# Two-step approach
reg.data <- simulation(gamma_32 = gamma)
first.step <- lm(x_1 ~ x_2, data = reg.data)
x_1_resid <- first.step$residuals
summary(lm(y ~ x_1_resid + x_3, data = reg.data))
## 
## Call:
## lm(formula = y ~ x_1_resid + x_3, data = reg.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.301  -2.211  -0.078   2.193  10.092 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.00934    0.10306   -0.09   0.9278    
## x_1_resid    0.34591    0.12029    2.88   0.0041 ** 
## x_3          3.27027    0.09539   34.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.26 on 997 degrees of freedom
## Multiple R-squared:  0.546,  Adjusted R-squared:  0.545 
## F-statistic:  599 on 2 and 997 DF,  p-value: <2e-16
# One-step approach
summary(lm(y ~ x_1 + x_2 + x_3, data = reg.data))
## 
## Call:
## lm(formula = y ~ x_1 + x_2 + x_3, data = reg.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1001 -0.2724  0.0017  0.2666  1.2407 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.00768    0.01215    0.63     0.53    
## x_1         -0.00364    0.01424   -0.26     0.80    
## x_2          2.99629    0.01289  232.54   <2e-16 ***
## x_3          1.99998    0.01222  163.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.384 on 996 degrees of freedom
## Multiple R-squared:  0.994,  Adjusted R-squared:  0.994 
## F-statistic: 5.23e+04 on 3 and 996 DF,  p-value: <2e-16

Now let's do that 1000 times and plot the t-statistics from the two-step approach.

sim_2 <- function(n) {
    require(lmtest)
    reg.data <- simulation(gamma_32=gamma)
    first.step <- lm(x_1 ~ x_2, data=reg.data)
    x_1_resid <-first.step$residuals
    fitted.2s <- lm(y ~ x_1_resid + x_3,  data=reg.data)
    fitted.full <- lm(y ~ x_1 + x_2 + x_3,  data=reg.data)
    return(data.frame(n=n, 
                      t.stat.2s=coeftest(fitted.2s)[2,3],
                      p.val.2s=coeftest(fitted.2s)[2,4],
                      t.stat.full=coeftest(fitted.full)[2,3],
                      p.val.full=coeftest(fitted.full)[2,4]))
}

sim_2_results <- 
    foreach (i = 1:M, .combine="rbind") %dopar% {
        sim_2(i)
    }

sig.level <- 0.05
sim_2_results$significant.2s <- sim_2_results$p.val.2s < sig.level
sim_2_results$significant.full <- sim_2_results$p.val.full < sig.level

library(ggplot2)
qplot(t.stat.2s, data=sim_2_results, fill=significant.2s, binwidth=0.03)

plot of chunk tstat_impact_reps

The two-step approach yields statistically significant results in 96.6% of cases (at the 5% level with a two-sided test) versus 5.9% cases with the full regression approach.