Subset data for linear fit

df <- iris %>% filter(Petal.Length > 2)
# str(df)

df %>% 
  ggplot(aes(x = Petal.Length, 
             y = Sepal.Length)) + 
  geom_point(alpha = 0.3) + 
  stat_smooth(method = lm, 
              fullrange = TRUE) + 
  coord_cartesian(xlim = c(0,10), 
                  ylim = c(0,10))

Model without intercept

m0 <- lm(Sepal.Length ~ Petal.Length - 1, 
         data = df)
summary(m0)

## 
## Call:
## lm(formula = Sepal.Length ~ Petal.Length - 1, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.25837 -0.48045  0.05331  0.61564  1.32082 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## Petal.Length  1.25973    0.01248   100.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6209 on 99 degrees of freedom
## Multiple R-squared:  0.9904, Adjusted R-squared:  0.9903 
## F-statistic: 1.019e+04 on 1 and 99 DF,  p-value: < 2.2e-16

augment(m0) %>% 
  ggplot(aes(x = Sepal.Length, 
             y = .fitted)) + 
  geom_point(alpha = .5) + 
  coord_cartesian(xlim = c(0,10), 
                  ylim = c(0,10)) + 
  geom_abline(intercept = 0, 
              slope = 1)

tidy(m0) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped",
                "condensed", 
                "responsive"))

term	estimate	std.error	statistic	p.value
Petal.Length	1.259728	0.0124817	100.9259	0

Model with intercept

m1 <- lm(Sepal.Length ~ Petal.Length, 
         data = df)
summary(m1)

## 
## Call:
## lm(formula = Sepal.Length ~ Petal.Length, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.09194 -0.26570  0.00761  0.21902  0.87502 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.99871    0.22593   13.27   <2e-16 ***
## Petal.Length  0.66516    0.04542   14.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3731 on 98 degrees of freedom
## Multiple R-squared:  0.6864, Adjusted R-squared:  0.6832 
## F-statistic: 214.5 on 1 and 98 DF,  p-value: < 2.2e-16

augment(m1) %>% 
  ggplot(aes(x = Sepal.Length, 
             y = .fitted)) + 
  geom_point(alpha = .5) + 
  coord_cartesian(xlim = c(0,10), 
                  ylim = c(0,10)) +
  geom_abline(intercept = 0, 
              slope = 1)

tidy(m1) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped",
                                      "condensed", 
                                      "responsive"))

term	estimate	std.error	statistic	p.value
(Intercept)	2.9987107	0.2259275	13.27289	0
Petal.Length	0.6651629	0.0454190	14.64503	0

Discussion

Why does Rsq go down when adding a significant intercept term?

Ans. It doesn’t really. Rsq and Adj-Rsq are almost always intended to be used with an intercept in the model. When you drop the intercept, you’re actually calculating a different number, which is not directly comparable to the Rsq for the models with intercept.

See here for more.

It helps to recall what R2 is trying to measure. In the former case, it is comparing your current model to the reference model that only includes an intercept (i.e., constant term). In the second case, there is no intercept, so it makes little sense to compare it to such a model. So, instead, \(R2_0\) is computed, which implicitly uses a reference model corresponding to noise only.

In general, never drop the intercept unless you have very good reason to.

Side note: Given the above it’s a little insane that in python statsmodels does not include an intercept by default.

Iris data: regression with and without intercept

Nayef Ahmad

2019-12-09

Subset data for linear fit

Model without intercept

Model with intercept

Discussion