The linear regression models examined so far have always included a constant that represents the point the regression line crosses the y-axis, called the intercept. However, there are some cases where an intercept may not conceptually apply to the data being modeled. For example, a factory cannot produce widgets if the equipment is not running, a salesperson cannot sell without any products, and so on. Although the apriori knowledge that \(y = 0\) when \(x = 0\) is not enough to completely justify regression through the origin, (Hocking, as cited in Eisenhauer) the resulting linear model may be a better fit. Methods of determining if an intercept should be included or not vary from testing if the intercept is significant and fitting both model and comparing the standard error.
In a previous example with the cars dataset, the speed and stopping distance of cars was modeled with linear regression. This case lends itself to regression through the origin, as one would assume if a car’s speed were 0, then it wouldn’t have a stopping distance since it’s not moving. This post will explore regression through the origin in comparison to the model fitted in an earlier example to determine if the reasoning given above yields a more well-fitted regression model.
Load the cars dataset and the packages that will be used.
data("cars")
library(ggplot2)
library(gridExtra)
Building a linear regression model that passes through the origin is straightforward with the lm() function. A zero term is added into the formula to tell the function to restrict the line to the origin.
Both models are fit and the respective summaries printed using the summary() function.
cars.lm <- lm(dist ~ speed, data = cars)
cars.lm2 <- lm(dist ~ 0 + speed, data = cars) # Adding the 0 term tells the lm() to fit the line through the origin
summary(cars.lm)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
summary(cars.lm2)
##
## Call:
## lm(formula = dist ~ 0 + speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.183 -12.637 -5.455 4.590 50.181
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## speed 2.9091 0.1414 20.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.26 on 49 degrees of freedom
## Multiple R-squared: 0.8963, Adjusted R-squared: 0.8942
## F-statistic: 423.5 on 1 and 49 DF, p-value: < 2.2e-16
The most noticeable difference between the two models is the regression through the origin model has a much higher \(r^2\) of almost .90. A larger \(r^2\) does not imply a better fit, however. In fact, with such a large \(r^2\), it could be possible the model is overfit. Of the conceptual criteria mentioned earlier to gauge which model is more appropriate, the original regression model has a lower residual standard error and the intercept is significant. However, the residual standard errors of the models are not drastically different, and the standard error of the predictor coefficient is lower in the RTO model. Since the original design meets both of the specified criteria, it should be the most appropriate model. To confirm the initial model is more suitable, a quick ANOVA table is constructed to compare the two models.
anova(cars.lm, cars.lm2)
## Analysis of Variance Table
##
## Model 1: dist ~ speed
## Model 2: dist ~ 0 + speed
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 48 11354
## 2 49 12954 -1 -1600.3 6.7655 0.01232 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The reported ANOVA table shows the regression through the origin model has a higher RSS and the difference between the two models is significant, therefore the decision to select the original model was indeed correct.
What’s interesting in this case is although the logic behind fitting the model through the origin made intuitive sense, it ended up not being the most optimal model that fits the data. Thus, as referenced at the beginning of the post, the knowledge that \(y\) may equal zero when \(x\) is zero is not enough justification to use RTO.
The two regression model fits can also be visualized.
lm.scatter <- ggplot(cars, aes(x=speed, y=dist)) +
geom_point(color='#2980B9', size = 4) + xlim(c(0, 25)) +
geom_smooth(method=lm, se=FALSE, fullrange=TRUE, color='#2C3E50') +
labs(title='Original Regression Line')
lm.scatter2 <- ggplot(cars, aes(x=speed, y=dist)) +
geom_point(color='#2980B9', size = 4) + xlim(c(0, 25)) +
geom_abline(intercept=0, slope=cars.lm2$coefficients[1], color='#2C3E50', size=1.1) +
labs(title='Regression through the Origin')
grid.arrange(lm.scatter, lm.scatter2)
In the regression through the origin setting, the linear model becomes:
\[ y_i = \beta_{1} x_1 + \epsilon_i \]
The error term remains normally distributed with mean 0 and standard error \(\sigma^2\).
The \(\beta_1\) estimator is unchanged:
\[ \beta_1 = \frac{\sum x_i y_i}{\sum x_i^2} \]
As there is only one estimated parameter in RTO, the degrees of freedom increases by one. Therefore the \(SSE\) and \(MSE\) are:
\[ SSE = \sum^n_{i=1} (Y_i - \beta_1 x_i)^2 = \sum y_i^2 - \beta^2_i \sum x^2_i \]
Or, \(SSE = \sum y^2_i -SSR\), where \(SSR\) is the sum square of residuals and is defined as:
\[ SSR = \beta^2_i \sum x^2_i \]
Thus the \(MSE\) is:
\[ \frac{SSE}{n - 1} \]
The results from the linear model can be verified computing the above equations.
rto.estimates <- function(x, y) {
b1 <- sum(x * y) / sum(x^2)
ssr <- b1^2 * sum(x^2)
sse <- sum(y^2) - ssr
mse <- sse / (length(x) - 1)
msr <- ssr / 1
res.std.err <- sqrt(mse)
f.stat <- msr / mse
std.error <- sqrt(mse / sum(x^2))
r2 <- ssr / (sse + ssr)
adj.r2 <- r2 - (1 - r2) * (2 - 1) / (length(x) - 1)
res <- data.frame(rbind(b1, res.std.err, f.stat, std.error, r2, adj.r2))
rownames(res) <- c('b1', 'Residual Standard Error', 'F-Statistic', 'b1 Standard Error',
'r-squared', 'Adjusted r-squared')
colnames(res) <- 'Estimates'
print(format(res, scientific = FALSE, digits = 3))
}
rto.estimates(cars$speed, cars$dist)
## Estimates
## b1 2.909
## Residual Standard Error 16.259
## F-Statistic 423.468
## b1 Standard Error 0.141
## r-squared 0.896
## Adjusted r-squared 0.894
cars.lm2
##
## Call:
## lm(formula = dist ~ 0 + speed, data = cars)
##
## Coefficients:
## speed
## 2.909
Regression through the origin can be appropriate for many situations where it makes intuitive sense that \(y = 0\) when \(x = 0\), but it should be thoroughly examined before accepting the resulting model. As seen earlier, the beforehand knowledge that stopping distance would obviously be 0 if a car was not moving did not lead to a better fit model. In cases where it may make sense to fit a model through the origin, it is recommended to test it using the criteria above and to compare the model to a model with an intercept.
Eisenhauer, J. Regression through the Origin. From https://online.stat.psu.edu/~ajw13/stat501/SpecialTopics/Reg_thru_origin.pdf
Regression through the origin. (2013, November 21). Retrieved from http://statwiki.ucdavis.edu/Core/Regression_Analysis/Simple_linear_regression/Regression_through_the_origin