The choice of functional form matters (for small data sizes), even if the models are algebraically equivalent.

Take model 1 for example:

\(y = \beta_0 + \beta_1(x_1 - x_2) + \epsilon\)

Model 1 appears equivalent to model 2 if we do some algebra:

\(y = \beta_0 + \beta_1x_1 - \beta_1x_2 + \epsilon\)

When we estimate model 2 with OLS, the functional form changes to:

\(y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \epsilon\)

Surprisingly the predictions from model 1 and 2 are not the same even if we use OLS to estimate both models on the same data.

First, let us simulate some data for \(x_1\), \(x_2\), and \((x_1 - x_2)\).

set.seed(1983)

x1 <- rnorm(40, 10, 1)
x2 <- rnorm(40, 5, 1)

diff_x1_x2 <- x1 - x2

Second, let us assume that the “true model” is model 1 with a \(\epsilon \sim N(0, 5)\).

set.seed(1984)

y = 10 + 2 * (diff_x1_x2) + rnorm(40,0,5)

df <- data.frame(y=y, x1=x1, x2=x2, diff_x1_x2=diff_x1_x2)

Third, fit model 1 and model 2 using OLS.

lm1 <- lm(y~diff_x1_x2, data=df)

summary(lm1)
## 
## Call:
## lm(formula = y ~ diff_x1_x2, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.0826  -2.6528   0.4569   3.5331  11.7927 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.1803     2.7774   4.025 0.000262 ***
## diff_x1_x2    1.9291     0.5303   3.638 0.000814 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.911 on 38 degrees of freedom
## Multiple R-squared:  0.2583, Adjusted R-squared:  0.2388 
## F-statistic: 13.23 on 1 and 38 DF,  p-value: 0.000814
lm2 <- lm(y~x1 + x2, data=df)

summary(lm2)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.0637  -2.5646   0.5373   2.9885  10.6699 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   2.0659     9.3736   0.220  0.82677   
## x1            2.5187     0.7851   3.208  0.00276 **
## x2           -1.3045     0.8108  -1.609  0.11612   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.909 on 37 degrees of freedom
## Multiple R-squared:  0.2785, Adjusted R-squared:  0.2395 
## F-statistic: 7.141 on 2 and 37 DF,  p-value: 0.002385

Although the two models have similar \(R^2\), the predictions are not the same. Also notice that the coefficient for \(x_2\) is not significant at the 10% level (even though we know for an absolute fact that \(x_2\) is part of the “true model” in a linear fashion). As a side note, rejecting a variable because the coefficient is “not statistically significant” is very risky when data sets are small (n < 100).

df.plot <- data.frame(y=y, lm1.pred=lm1$fitted.values, lm2.pred=lm2$fitted.values)

cor(df.plot$lm1.pred, df.plot$lm2.pred)^2
## [1] 0.9274339
plot(df.plot)

Fortunately, if we increase the sample size, the differences between the two models are extremely small.

set.seed(1983)

x1 <- rnorm(500, 10, 1)
x2 <- rnorm(500, 5, 1)

diff_x1_x2 <- x1 - x2
set.seed(1984)

y = 10 + 2 * (diff_x1_x2) + rnorm(500,0,5)

df <- data.frame(y=y, x1=x1, x2=x2, diff_x1_x2=diff_x1_x2)
lm1 <- lm(y~diff_x1_x2, data=df)

summary(lm1)
## 
## Call:
## lm(formula = y ~ diff_x1_x2, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2469  -3.1620  -0.2106   3.3197  13.7638 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.3356     0.7897   11.82   <2e-16 ***
## diff_x1_x2    2.1115     0.1512   13.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.816 on 498 degrees of freedom
## Multiple R-squared:  0.2813, Adjusted R-squared:  0.2798 
## F-statistic: 194.9 on 1 and 498 DF,  p-value: < 2.2e-16
lm2 <- lm(y~x1 + x2, data=df)

summary(lm2)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2511  -3.1556  -0.1989   3.3122  13.7672 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.5185     2.5328   3.758 0.000192 ***
## x1            2.0990     0.2235   9.392  < 2e-16 ***
## x2           -2.1230     0.2135  -9.945  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.821 on 497 degrees of freedom
## Multiple R-squared:  0.2813, Adjusted R-squared:  0.2784 
## F-statistic: 97.26 on 2 and 497 DF,  p-value: < 2.2e-16
df.plot <- data.frame(y=y, lm1.pred=lm1$fitted.values, lm2.pred=lm2$fitted.values)

cor(df.plot$lm1.pred, df.plot$lm2.pred)^2
## [1] 0.9999703
plot(df.plot)