This demos the 0+ syntax in the formula mini-language in R. (This is what you use to define models with the lm function.)
First, go get your data and check it out.
remotes::install_github("allisonhorst/palmerpenguins")
## Skipping install of 'palmerpenguins' from a github remote, the SHA1 (aee26b51) has not changed since last install.
## Use `force = TRUE` to force installation
library(palmerpenguins)
data(package = 'palmerpenguins')
data(penguins)
names(penguins)
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex"
levels(penguins$species)
## [1] "Adelie" "Chinstrap" "Gentoo"
Without 0+, this chooses one of the species to be a reference group. The other species are compared against this ref group.
mod1 <- lm(bill_length_mm~species, data=penguins)
summary(mod1)
##
## Call:
## lm(formula = bill_length_mm ~ species, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9338 -2.2049 0.0086 2.0662 12.0951
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.7914 0.2409 161.05 <2e-16 ***
## speciesChinstrap 10.0424 0.4323 23.23 <2e-16 ***
## speciesGentoo 8.7135 0.3595 24.24 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.96 on 339 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7078, Adjusted R-squared: 0.7061
## F-statistic: 410.6 on 2 and 339 DF, p-value: < 2.2e-16
The predicted bill length for Adelie species:
coef(mod1)[1]
## (Intercept)
## 38.79139
The predicted bill length for the chinstrap species:
coef(mod1)[1] + coef(mod1)[2]
## (Intercept)
## 48.83382
The predicted bill length for the Gentoo species:
coef(mod1)[1] + coef(mod1)[3]
## (Intercept)
## 47.50488
With 0+, we have no reference group. We just get the predicted bill length for each species.
mod2 <- lm(bill_length_mm~ 0+ species, data=penguins)
summary(mod2)
##
## Call:
## lm(formula = bill_length_mm ~ 0 + species, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9338 -2.2049 0.0086 2.0662 12.0951
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## speciesAdelie 38.7914 0.2409 161.0 <2e-16 ***
## speciesChinstrap 48.8338 0.3589 136.1 <2e-16 ***
## speciesGentoo 47.5049 0.2669 178.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.96 on 339 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.9956, Adjusted R-squared: 0.9955
## F-statistic: 2.538e+04 on 3 and 339 DF, p-value: < 2.2e-16
Here are the predicted bill lengths for each species:
coef(mod2)
## speciesAdelie speciesChinstrap speciesGentoo
## 38.79139 48.83382 47.50488
You can see the predictions end up being the same either way. It’s just how you want to talk about the predictions. mod1 is particularly useful if the goal is compare Chinstrap and Gentoo against Adelie. mod2 is particularly useful if we don’t care about the comparison (if we just care about each species’ bill length).
It can be useful to say: “I am 5’9” and my brother is 6’" Or it can be useful to say “I am 5’9” and my brother is 3 inches taller than me." The two statements say the same thing (you can figure out my brother is 6feet tall by adding 3 inches to 5foot9). The second way (“I am 5’9” and my brother is 3 inches taller than me.“) uses me as a reference group. The first message (”I am 5’9" and my brother is 6’“) has no such reference group. Finally, we could make my brother the reference group by saying:”My brother is 6 feet tall and I’m 3 inches shorter than him."