df <- iris %>% filter(Petal.Length > 2)
# str(df)
df %>%
ggplot(aes(x = Petal.Length,
y = Sepal.Length)) +
geom_point(alpha = 0.3) +
stat_smooth(method = lm,
fullrange = TRUE) +
coord_cartesian(xlim = c(0,10),
ylim = c(0,10))
m0 <- lm(Sepal.Length ~ Petal.Length - 1,
data = df)
summary(m0)
##
## Call:
## lm(formula = Sepal.Length ~ Petal.Length - 1, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.25837 -0.48045 0.05331 0.61564 1.32082
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Petal.Length 1.25973 0.01248 100.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6209 on 99 degrees of freedom
## Multiple R-squared: 0.9904, Adjusted R-squared: 0.9903
## F-statistic: 1.019e+04 on 1 and 99 DF, p-value: < 2.2e-16
augment(m0) %>%
ggplot(aes(x = Sepal.Length,
y = .fitted)) +
geom_point(alpha = .5) +
coord_cartesian(xlim = c(0,10),
ylim = c(0,10)) +
geom_abline(intercept = 0,
slope = 1)
tidy(m0) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"))
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| Petal.Length | 1.259728 | 0.0124817 | 100.9259 | 0 |
m1 <- lm(Sepal.Length ~ Petal.Length,
data = df)
summary(m1)
##
## Call:
## lm(formula = Sepal.Length ~ Petal.Length, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.09194 -0.26570 0.00761 0.21902 0.87502
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.99871 0.22593 13.27 <2e-16 ***
## Petal.Length 0.66516 0.04542 14.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3731 on 98 degrees of freedom
## Multiple R-squared: 0.6864, Adjusted R-squared: 0.6832
## F-statistic: 214.5 on 1 and 98 DF, p-value: < 2.2e-16
augment(m1) %>%
ggplot(aes(x = Sepal.Length,
y = .fitted)) +
geom_point(alpha = .5) +
coord_cartesian(xlim = c(0,10),
ylim = c(0,10)) +
geom_abline(intercept = 0,
slope = 1)
tidy(m1) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"))
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 2.9987107 | 0.2259275 | 13.27289 | 0 |
| Petal.Length | 0.6651629 | 0.0454190 | 14.64503 | 0 |
Why does Rsq go down when adding a significant intercept term?
Ans. It doesn’t really. Rsq and Adj-Rsq are almost always intended to be used with an intercept in the model. When you drop the intercept, you’re actually calculating a different number, which is not directly comparable to the Rsq for the models with intercept.
See here for more.
It helps to recall what R2 is trying to measure. In the former case, it is comparing your current model to the reference model that only includes an intercept (i.e., constant term). In the second case, there is no intercept, so it makes little sense to compare it to such a model. So, instead, \(R2_0\) is computed, which implicitly uses a reference model corresponding to noise only.
In general, never drop the intercept unless you have very good reason to.
Side note: Given the above it’s a little insane that in python statsmodels does not include an intercept by default.