The choice of functional form matters (for small data sizes), even if the models are algebraically equivalent.
Take model 1 for example:
\(y = \beta_0 + \beta_1(x_1 - x_2) + \epsilon\)
Model 1 appears equivalent to model 2 if we do some algebra:
\(y = \beta_0 + \beta_1x_1 - \beta_1x_2 + \epsilon\)
When we estimate model 2 with OLS, the functional form changes to:
\(y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \epsilon\)
Surprisingly the predictions from model 1 and 2 are not the same even if we use OLS to estimate both models on the same data.
First, let us simulate some data for \(x_1\), \(x_2\), and \((x_1 - x_2)\).
set.seed(1983)
x1 <- rnorm(40, 10, 1)
x2 <- rnorm(40, 5, 1)
diff_x1_x2 <- x1 - x2
Second, let us assume that the “true model” is model 1 with a \(\epsilon \sim N(0, 5)\).
set.seed(1984)
y = 10 + 2 * (diff_x1_x2) + rnorm(40,0,5)
df <- data.frame(y=y, x1=x1, x2=x2, diff_x1_x2=diff_x1_x2)
Third, fit model 1 and model 2 using OLS.
lm1 <- lm(y~diff_x1_x2, data=df)
summary(lm1)
##
## Call:
## lm(formula = y ~ diff_x1_x2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0826 -2.6528 0.4569 3.5331 11.7927
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.1803 2.7774 4.025 0.000262 ***
## diff_x1_x2 1.9291 0.5303 3.638 0.000814 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.911 on 38 degrees of freedom
## Multiple R-squared: 0.2583, Adjusted R-squared: 0.2388
## F-statistic: 13.23 on 1 and 38 DF, p-value: 0.000814
lm2 <- lm(y~x1 + x2, data=df)
summary(lm2)
##
## Call:
## lm(formula = y ~ x1 + x2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0637 -2.5646 0.5373 2.9885 10.6699
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.0659 9.3736 0.220 0.82677
## x1 2.5187 0.7851 3.208 0.00276 **
## x2 -1.3045 0.8108 -1.609 0.11612
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.909 on 37 degrees of freedom
## Multiple R-squared: 0.2785, Adjusted R-squared: 0.2395
## F-statistic: 7.141 on 2 and 37 DF, p-value: 0.002385
Although the two models have similar \(R^2\), the predictions are not the same. Also notice that the coefficient for \(x_2\) is not significant at the 10% level (even though we know for an absolute fact that \(x_2\) is part of the “true model” in a linear fashion). As a side note, rejecting a variable because the coefficient is “not statistically significant” is very risky when data sets are small (n < 100).
df.plot <- data.frame(y=y, lm1.pred=lm1$fitted.values, lm2.pred=lm2$fitted.values)
cor(df.plot$lm1.pred, df.plot$lm2.pred)^2
## [1] 0.9274339
plot(df.plot)
Fortunately, if we increase the sample size, the differences between the two models are extremely small.
set.seed(1983)
x1 <- rnorm(500, 10, 1)
x2 <- rnorm(500, 5, 1)
diff_x1_x2 <- x1 - x2
set.seed(1984)
y = 10 + 2 * (diff_x1_x2) + rnorm(500,0,5)
df <- data.frame(y=y, x1=x1, x2=x2, diff_x1_x2=diff_x1_x2)
lm1 <- lm(y~diff_x1_x2, data=df)
summary(lm1)
##
## Call:
## lm(formula = y ~ diff_x1_x2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2469 -3.1620 -0.2106 3.3197 13.7638
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.3356 0.7897 11.82 <2e-16 ***
## diff_x1_x2 2.1115 0.1512 13.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.816 on 498 degrees of freedom
## Multiple R-squared: 0.2813, Adjusted R-squared: 0.2798
## F-statistic: 194.9 on 1 and 498 DF, p-value: < 2.2e-16
lm2 <- lm(y~x1 + x2, data=df)
summary(lm2)
##
## Call:
## lm(formula = y ~ x1 + x2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2511 -3.1556 -0.1989 3.3122 13.7672
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.5185 2.5328 3.758 0.000192 ***
## x1 2.0990 0.2235 9.392 < 2e-16 ***
## x2 -2.1230 0.2135 -9.945 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.821 on 497 degrees of freedom
## Multiple R-squared: 0.2813, Adjusted R-squared: 0.2784
## F-statistic: 97.26 on 2 and 497 DF, p-value: < 2.2e-16
df.plot <- data.frame(y=y, lm1.pred=lm1$fitted.values, lm2.pred=lm2$fitted.values)
cor(df.plot$lm1.pred, df.plot$lm2.pred)^2
## [1] 0.9999703
plot(df.plot)