Formula mini-language and reference groups

This demos the 0+ syntax in the formula mini-language in R. (This is what you use to define models with the lm function.)

First, go get your data and check it out.

remotes::install_github("allisonhorst/palmerpenguins")

## Skipping install of 'palmerpenguins' from a github remote, the SHA1 (aee26b51) has not changed since last install.
##   Use `force = TRUE` to force installation

library(palmerpenguins)
data(package = 'palmerpenguins')

data(penguins)
names(penguins)

## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"

levels(penguins$species)

## [1] "Adelie"    "Chinstrap" "Gentoo"

Without 0+, this chooses one of the species to be a reference group. The other species are compared against this ref group.

mod1 <- lm(bill_length_mm~species, data=penguins)
summary(mod1)

## 
## Call:
## lm(formula = bill_length_mm ~ species, data = penguins)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9338 -2.2049  0.0086  2.0662 12.0951 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       38.7914     0.2409  161.05   <2e-16 ***
## speciesChinstrap  10.0424     0.4323   23.23   <2e-16 ***
## speciesGentoo      8.7135     0.3595   24.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.96 on 339 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.7078, Adjusted R-squared:  0.7061 
## F-statistic: 410.6 on 2 and 339 DF,  p-value: < 2.2e-16

The predicted bill length for Adelie species:

coef(mod1)[1]

## (Intercept) 
##    38.79139

The predicted bill length for the chinstrap species:

coef(mod1)[1] + coef(mod1)[2]

## (Intercept) 
##    48.83382

The predicted bill length for the Gentoo species:

coef(mod1)[1] + coef(mod1)[3]

## (Intercept) 
##    47.50488

With 0+, we have no reference group. We just get the predicted bill length for each species.

mod2 <- lm(bill_length_mm~ 0+ species, data=penguins)
summary(mod2)

## 
## Call:
## lm(formula = bill_length_mm ~ 0 + species, data = penguins)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9338 -2.2049  0.0086  2.0662 12.0951 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## speciesAdelie     38.7914     0.2409   161.0   <2e-16 ***
## speciesChinstrap  48.8338     0.3589   136.1   <2e-16 ***
## speciesGentoo     47.5049     0.2669   178.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.96 on 339 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.9956, Adjusted R-squared:  0.9955 
## F-statistic: 2.538e+04 on 3 and 339 DF,  p-value: < 2.2e-16

Here are the predicted bill lengths for each species:

coef(mod2)

##    speciesAdelie speciesChinstrap    speciesGentoo 
##         38.79139         48.83382         47.50488

You can see the predictions end up being the same either way. It’s just how you want to talk about the predictions. mod1 is particularly useful if the goal is compare Chinstrap and Gentoo against Adelie. mod2 is particularly useful if we don’t care about the comparison (if we just care about each species’ bill length).

It can be useful to say: “I am 5’9” and my brother is 6’" Or it can be useful to say “I am 5’9” and my brother is 3 inches taller than me." The two statements say the same thing (you can figure out my brother is 6feet tall by adding 3 inches to 5foot9). The second way (“I am 5’9” and my brother is 3 inches taller than me.“) uses me as a reference group. The first message (”I am 5’9" and my brother is 6’“) has no such reference group. Finally, we could make my brother the reference group by saying:”My brother is 6 feet tall and I’m 3 inches shorter than him."

Formula mini-language and reference groups

Christina Knudson, PhD

7/11/2020