Let’s fit some models and work through the calculations “manually”. We’ll start by loading the “CPS85” dataset:
library(mosaic)
library(mosaicData)
data('CPS85')
Here is a simple model of “wage” as a function of “sex” and age“:
mod = lm(wage ~ sex + age, data = CPS85)
We can have a look at the coefficients, and we get the following values.
coef(mod)
## (Intercept) sexM age
## 4.65424744 2.27469091 0.08521512
The values can be ‘plugged in’ to create a simple linear equation:
4.65 + 2.274*sex + 0.0852*age
Now that we have this equation, we can start asking questions. What is the wage for a female of age 40? A male of age 30? We can start by working through the arithmetic “by hand”:
4.65 + 2.274*0 + 0.0852*40 # for the 40-year old female
## [1] 8.058
4.65 + 2.274*1 + 0.0852*30 # for the 30-year old male
## [1] 9.48
We can also turn the model information into a mathematical function (thanks to the “makeFun” function in the “mosaic” package):
f = makeFun(mod)
f(sex = "F", age = 40) # for the 40-year old female
## 1
## 8.062852
f(sex = "M", age = 30) # for the 30-year old male
## 1
## 9.485392
In order to help you to interpret coefficients correctly, it is helpful to see how they relate to some basic model designs. Often, the interpretation for more complicated designs is not too different. To illustrate how basic model designs apply generally, we will use A to stand for a generic response variable, B to stand for a quantitative explanatory variable, and G for a categorical explanatory variable. As usual, 1 refers to the intercept. We are going to start by using the “Galton” and “CPS85” wage datasets in the examples:
data("Galton")
data("CPS85") # technically, this has already been loaded
This is the simplest possible model, where there are no explanatory variables and the only model term is the intercept. In this case, the coefficient of the model is the mean of A.
coef(lm(height ~ 1, data=Galton)) # Galton height response
## (Intercept)
## 66.76069
coef(lm(wage ~ 1, data=CPS85)) # CPS wage response
## (Intercept)
## 9.024064
This is also a pretty simple model: A categorical variable G along with the intercept. The categorical variable can be thought of as splitting the data into groups, one for each level of the variable. The coefficients are simply group-wise means for each group, only instead of a mean for each group, we have differences. The intercept is the reference group mean and each coefficient is the difference between its group’s mean and the reference group’s mean:
coef(lm(height ~ 1 + sex, data=Galton)) # Galton height response with sex categorical variable
## (Intercept) sexM
## 64.110162 5.118656
coef(lm(wage ~ sex - 1, data=CPS85)) # CPS data, intercept has been suppressed
## sexF sexM
## 7.878857 9.994913
In this second example, we have suppressed the intercept. Now there is no reference group mean, and each coefficient is simply the group mean:
mean(wage~sex, data=CPS85)
## F M
## 7.878857 9.994913
For categorical variables with more than two levels, things remain pretty much the same, although we get a lot more coefficients:
coef(lm(wage ~ sector, data=CPS85)) # intercept (1) will be included by default
## (Intercept) sectorconst sectormanag sectormanuf sectorother
## 7.4225773 2.0794227 5.2814227 0.6134521 1.0780109
## sectorprof sectorsales sectorservice
## 4.5248513 0.1700543 -0.8851074
This is the basic straight line relationship. The two coefficients in this case are the intercept and the slope of the line. The slope tells us what change in A corresponds to a one-unit change in B:
coef(lm(wage ~ 1 + educ, data=CPS85)) # educ = years of education (numeric)
## (Intercept) educ
## -0.7459797 0.7504608
Now we are getting a bit more complicated. This model gives a straight line relationship between A and B, but allows for different lines for each level of the categorical variable G. The lines are all parallel and differ only in their intercepts:
coef(lm(wage ~ 1 + educ + sex, data=CPS85))
## (Intercept) educ sexM
## -1.9062255 0.7512834 2.1240567
This is once again a straight line model, but this time, the slopes of the different lines may differ due to the interaction term between G and B, which says how the slopes differ for the different groups.
coef(lm(wage ~ 1 + educ + sex + educ:sex, data=CPS85))
## (Intercept) educ sexM educ:sexM
## -3.2658785 0.8556754 4.3704499 -0.1725303
Here is where interpreting coefficients gets a bit more complicated. In the above example, an extra year of education is associated with an increase of wages of 85.5 cents per hour for women. However, for men, the relationship is weaker - an increase in education of one year is associated with an increase in wages of only 0.856 - 0.173 = 68.3 cents per hour.
This demo is based directly on material from ‘Statistical Modeling: A Fresh Approach (2nd Edition)’ by Daniel Kaplan.