Model Terms, Formulas & Coefficients

Model Formulas and Coefficients

Let’s fit some models and work through the calculations “manually”. We’ll start by loading the “CPS85” dataset:

library(mosaic)
library(mosaicData)
data('CPS85')

Here is a simple model of “wage” as a function of “sex” and age“:

mod = lm(wage ~ sex + age, data = CPS85)

We can have a look at the coefficients, and we get the following values.

coef(mod)
## (Intercept)        sexM         age 
##  4.65424744  2.27469091  0.08521512

The values can be ‘plugged in’ to create a simple linear equation:

4.65 + 2.274*sex + 0.0852*age

Now that we have this equation, we can start asking questions. What is the wage for a female of age 40? A male of age 30? We can start by working through the arithmetic “by hand”:

4.65 + 2.274*0 + 0.0852*40  # for the 40-year old female
## [1] 8.058
4.65 + 2.274*1 + 0.0852*30  # for the 30-year old male
## [1] 9.48

We can also turn the model information into a mathematical function (thanks to the “makeFun” function in the “mosaic” package):

f = makeFun(mod)
f(sex = "F", age = 40) # for the 40-year old female
##        1 
## 8.062852
f(sex = "M", age = 30) # for the 30-year old male
##        1 
## 9.485392

Coefficients of basic model designs

In order to help you to interpret coefficients correctly, it is helpful to see how they relate to some basic model designs. Often, the interpretation for more complicated designs is not too different. To illustrate how basic model designs apply generally, we will use A to stand for a generic response variable, B to stand for a quantitative explanatory variable, and G for a categorical explanatory variable. As usual, 1 refers to the intercept. We are going to start by using the “Galton” and “CPS85” wage datasets in the examples:

data("Galton")
data("CPS85") # technically, this has already been loaded

Model A ~ 1

This is the simplest possible model, where there are no explanatory variables and the only model term is the intercept. In this case, the coefficient of the model is the mean of A.

coef(lm(height ~ 1, data=Galton)) # Galton height response
## (Intercept) 
##    66.76069
coef(lm(wage ~ 1, data=CPS85)) # CPS wage response
## (Intercept) 
##    9.024064

Model A ~ 1 + G

This is also a pretty simple model: A categorical variable G along with the intercept. The categorical variable can be thought of as splitting the data into groups, one for each level of the variable. The coefficients are simply group-wise means for each group, only instead of a mean for each group, we have differences. The intercept is the reference group mean and each coefficient is the difference between its group’s mean and the reference group’s mean:

coef(lm(height ~ 1 + sex, data=Galton)) # Galton height response with sex categorical variable
## (Intercept)        sexM 
##   64.110162    5.118656
coef(lm(wage ~ sex - 1, data=CPS85)) # CPS data, intercept has been suppressed
##     sexF     sexM 
## 7.878857 9.994913

In this second example, we have suppressed the intercept. Now there is no reference group mean, and each coefficient is simply the group mean:

mean(wage~sex, data=CPS85)
##        F        M 
## 7.878857 9.994913

For categorical variables with more than two levels, things remain pretty much the same, although we get a lot more coefficients:

coef(lm(wage ~ sector, data=CPS85)) # intercept (1) will be included by default
##   (Intercept)   sectorconst   sectormanag   sectormanuf   sectorother 
##     7.4225773     2.0794227     5.2814227     0.6134521     1.0780109 
##    sectorprof   sectorsales sectorservice 
##     4.5248513     0.1700543    -0.8851074

Model A ~ 1 + B

This is the basic straight line relationship. The two coefficients in this case are the intercept and the slope of the line. The slope tells us what change in A corresponds to a one-unit change in B:

coef(lm(wage ~ 1 + educ, data=CPS85)) # educ = years of education (numeric)
## (Intercept)        educ 
##  -0.7459797   0.7504608

Model A ~ 1 + G + B

Now we are getting a bit more complicated. This model gives a straight line relationship between A and B, but allows for different lines for each level of the categorical variable G. The lines are all parallel and differ only in their intercepts:

coef(lm(wage ~ 1 + educ + sex, data=CPS85))
## (Intercept)        educ        sexM 
##  -1.9062255   0.7512834   2.1240567

Model A ~ 1 + G + B + G:B

This is once again a straight line model, but this time, the slopes of the different lines may differ due to the interaction term between G and B, which says how the slopes differ for the different groups.

coef(lm(wage ~ 1 + educ + sex + educ:sex, data=CPS85))
## (Intercept)        educ        sexM   educ:sexM 
##  -3.2658785   0.8556754   4.3704499  -0.1725303

Here is where interpreting coefficients gets a bit more complicated. In the above example, an extra year of education is associated with an increase of wages of 85.5 cents per hour for women. However, for men, the relationship is weaker - an increase in education of one year is associated with an increase in wages of only 0.856 - 0.173 = 68.3 cents per hour.

Reference

This demo is based directly on material from ‘Statistical Modeling: A Fresh Approach (2nd Edition)’ by Daniel Kaplan.