Subset data for linear fit

df <- iris %>% filter(Petal.Length > 2)
# str(df)

df %>% 
  ggplot(aes(x = Petal.Length, 
             y = Sepal.Length)) + 
  geom_point(alpha = 0.3) + 
  stat_smooth(method = lm, 
              fullrange = TRUE) + 
  coord_cartesian(xlim = c(0,10), 
                  ylim = c(0,10)) 

Model without intercept

m0 <- lm(Sepal.Length ~ Petal.Length - 1, 
         data = df)
summary(m0)
## 
## Call:
## lm(formula = Sepal.Length ~ Petal.Length - 1, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.25837 -0.48045  0.05331  0.61564  1.32082 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## Petal.Length  1.25973    0.01248   100.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6209 on 99 degrees of freedom
## Multiple R-squared:  0.9904, Adjusted R-squared:  0.9903 
## F-statistic: 1.019e+04 on 1 and 99 DF,  p-value: < 2.2e-16
augment(m0) %>% 
  ggplot(aes(x = Sepal.Length, 
             y = .fitted)) + 
  geom_point(alpha = .5) + 
  coord_cartesian(xlim = c(0,10), 
                  ylim = c(0,10)) + 
  geom_abline(intercept = 0, 
              slope = 1)

tidy(m0) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped",
                "condensed", 
                "responsive"))
term estimate std.error statistic p.value
Petal.Length 1.259728 0.0124817 100.9259 0

Model with intercept

m1 <- lm(Sepal.Length ~ Petal.Length, 
         data = df)
summary(m1)
## 
## Call:
## lm(formula = Sepal.Length ~ Petal.Length, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.09194 -0.26570  0.00761  0.21902  0.87502 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.99871    0.22593   13.27   <2e-16 ***
## Petal.Length  0.66516    0.04542   14.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3731 on 98 degrees of freedom
## Multiple R-squared:  0.6864, Adjusted R-squared:  0.6832 
## F-statistic: 214.5 on 1 and 98 DF,  p-value: < 2.2e-16
augment(m1) %>% 
  ggplot(aes(x = Sepal.Length, 
             y = .fitted)) + 
  geom_point(alpha = .5) + 
  coord_cartesian(xlim = c(0,10), 
                  ylim = c(0,10)) +
  geom_abline(intercept = 0, 
              slope = 1)

tidy(m1) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped",
                                      "condensed", 
                                      "responsive"))
term estimate std.error statistic p.value
(Intercept) 2.9987107 0.2259275 13.27289 0
Petal.Length 0.6651629 0.0454190 14.64503 0

Discussion

Why does Rsq go down when adding a significant intercept term?

Ans. It doesn’t really. Rsq and Adj-Rsq are almost always intended to be used with an intercept in the model. When you drop the intercept, you’re actually calculating a different number, which is not directly comparable to the Rsq for the models with intercept.

See here for more.

It helps to recall what R2 is trying to measure. In the former case, it is comparing your current model to the reference model that only includes an intercept (i.e., constant term). In the second case, there is no intercept, so it makes little sense to compare it to such a model. So, instead, \(R2_0\) is computed, which implicitly uses a reference model corresponding to noise only.

In general, never drop the intercept unless you have very good reason to.

Side note: Given the above it’s a little insane that in python statsmodels does not include an intercept by default.