Intro and data

This tutorial shows the use of visreg to visualize fitted regression models. It is quite useful to visualize a model to make sure it is clear how the predictors (independent variables) affect the response variable, which is often not clear from summary statements (and certainly not from p-values alone).

The data in this tutorial come from a study in the 1970s on dimensions of children (see this website and search for ‘Snyder’). The dataset includes many variables but here I use only foot length, age, height, and gender.

The data are stored online, which we can read directly with read.csv.

anthro <- read.csv("https://bitbucket.org/remkoduursma/hiermanual/raw/ef664d86d4877b2c2e821806f9d95e0f5bf8c8e4/data/anthropometry.csv")

# Makes a nice table in markdown, showing the first few rows.
knitr::kable(head(anthro), align="l")
age gender foot_length height
4.166667 female 163 103.3
4.250000 male 163 103.9
4.416667 female 171 111.2
3.833333 female 163 99.7
3.416667 female 167 99.7
3.666667 male 164 98.7

A simple plot of the data shows that foot length is similar for males and females when they are young, but start to diverge at around age 12-14: feet keep growing for boys, but stop growing for girls.

palette(c("grey","cornflowerblue"))
with(anthro, plot(age, foot_length, pch=19, col=gender, cex=0.6))

One numeric variable

With just one variable in our model, visreg is straightforward, as the symbols are the data.

lm0 <- lm(foot_length ~ age, data=anthro)

# Simple plot with the regression line
# abline does not do confidence intervals, and the line extends beyond the data
with(anthro, plot(age, foot_length, pch=19, col="grey"))
abline(lm0)

# This is much better
library(visreg)

visreg(lm0, "age")

# You can confirm for yourself that the plotted symbols are the actual data,
# since there is only one term in the model.
# (not run)
with(anthro, points(age, foot_length, pch=16, cex=0.1, col="red"))

Back-transforming

If you fit your model using a transformation, you can use the trans argument to back-transform. To get rid of a log, we need exp.

lm3 <- lm(log(foot_length) ~ log(age), data=anthro)

visreg(lm3, "age", trans=exp, partial=TRUE)

Two numeric variables

Partial residuals

With two variables in the model, visreg will plot the predicted response against a variable of interest, while holding the other variable at some constant value. This can be both informative when used correctly, and utterly confusing at other times.

In addition, visreg will plot the partial residuals, which are well explained in this detailed post, from which we learn that:

Ultimately what this whole partial residual/termplot thing is doing is splitting the response value into different parts: an overall mean, a term that is due to the first predictor, and a term that is due to second predictor. And you have the residual. So when we create the partial residual for the first predictor, what we’re doing is adding the first predictor term and the residual. This sum accounts for the part of the response not explained by the other terms.

(Clay Ford)

Let’s look at an example.

Height and age

# Age and height as predictors, and the interaction
lm1 <- lm(foot_length ~ age * height, data=anthro)

# The p-values alone tell us almost nothing
library(car)
Anova(lm1)
## Anova Table (Type II tests)
## 
## Response: foot_length
##            Sum Sq   Df F value    Pr(>F)    
## age         19881    1  252.85 < 2.2e-16 ***
## height     599219    1 7620.92 < 2.2e-16 ***
## age:height  11348    1  144.33 < 2.2e-16 ***
## Residuals  300910 3827                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# The usual diagnostics are not very informative either
qqPlot(lm1)

# Residuals vs. fitted
residualPlot(lm1)

A naive visreg using only ‘age’ as the X-axis variable gives a rather mysterious plot. This is produced by setting height to the median value in the dataset, plotting the prediction in the model, as well as the partial residuals. Also note the warning from visreg, that interactions are present and thus we should suspect the results.

# And this is rather mysterious!
visreg(lm1, "age")
## Please note that you are attempting to plot a 'main effect' in a
## model that contains an interaction.  This is potentially
## misleading; you may wish to consider using the 'by' argument.
## 
## Conditions used in construction of plot
## height: 141.3

It may take some time to make sense of this plot. When we made a plot of foot length against age, we noted a clear increase in foot length. When we inspect the summary for lm1 (you should), the coefficient for age is positive - so surely foot length should not decrease with age? The difference is that we here plotted the age effect when height is set to a constant value. This tells us that, at a given height, children that were older (i.e. took longer to reach that height, and thus slower-growing) have smaller feet. That’s pretty neat, and visreg helped us understand this pattern in the data.

With two variables in your model, it will usually make much more sense to use both variables in the plot. The first option is to to condition the above plot by the second numeric variable (in our case, height).

The lines in the following plot are fitted values from the model, when we set height to be equal to 3 constant values, which are chosen automatically by visreg (we can also set them, which I will show further below).

visreg(lm1, "age", by="height", overlay=TRUE)

# Demonstrate that the lines are predictions from the model, at constant height:
p <- predict(lm1, newdata=data.frame(age=2:10, height=103.1))
points(2:10, p)

Note that the partial residuals in the plot are split between the three groups. This is done by assigning the observations to each of the three height values that are closest to those heights (e.g. an observation with a height of 115cm will be assigned to the first group, since it is closest to 103.1cm). (Note: the documentation of visreg does not explain this as far as I know, but it can be learned from browsing the code of the subsetV function in the package, albeit with difficulty).

In this particular example, it may be more intuitive to make the opposite plot: foot length versus height, at three ages.

visreg(lm1, "height", by="age", overlay=TRUE)

Three-dimensional visreg

When you fit a linear regression model with two numeric variables, you are in effect fitting a plane to the data (as opposed to a line when you have one variable). This can be visualized with visreg2d, although perhaps not very effectively for this example.

visreg2d(lm1, "height", "age", plot.type="persp")

Using rgl, you get a nice 3D plot you can rotate (try this yourself).

visreg2d(lm1, "height", "age", plot.type="rgl")

Multiple factor variables

This example shows the behaviour of visreg when you have two factor variables and one numeric variable in your model.

# Make an 'age class' with cut (result is a factor with three levels)
anthro$ageClass <- cut(anthro$age, c(0,7,12,20))


lm2 <- lm(foot_length ~ height*ageClass*gender, data=anthro)

visreg(lm2, "ageClass", by="gender", overlay=TRUE)

So where did height go, and why did visreg not complain? It is important to know how visreg made this plot: it sets height to the median value in the dataset, and then finds the fitted values. We can reproduce this with predict, like so:

# These values will be the ones used in the left-most group in the previous plot
predict(lm2, data.frame(height=median(anthro$height, na.rm=TRUE), 
                        gender=c("male","female"), 
                        ageClass="(0,7]"),
        interval="confidence")
##        fit      lwr      upr
## 1 221.1568 218.7091 223.6044
## 2 219.3327 216.8606 221.8049

The median height can be a useful choice, but we can take any value to condition the other variables in our model, using the cond argument:

visreg(lm2, "ageClass", by="gender", overlay=TRUE,
       cond=list(height=160))

Quadratic models

This example shows that visreg has no trouble with quadratic (or higher order) terms in a linear model,

lm4 <- lm(foot_length ~ age*gender + I(age^2) + I(age^2):gender,
          data=anthro)

visreg(lm4, "age", by="gender", overlay=TRUE)

Adjusting the plot

Changing colours of the symbols etc. is possible, but note that the help file ?visreg points us to ?plot.visreg. For example, we can adjust the symbols, lines, and confidence regions:

ugly_cols <- c("green","red")
visreg(lm4, "age", by="gender", overlay=TRUE,
      legend=FALSE, rug=FALSE,
       line.par=list(col=ugly_cols),
       points.par=list(col=ugly_cols),
      fill.par=list(col="lightgrey"))

Using visreg with other models

We can use visreg well beyond lm, including loess and gam shown below, but also glm, lmer, lme, and others.

loess

A lowess smoother is easy, but we cannot add a factor variable (e.g. gender) to this model.

# span (the smoothness) has to be set by the user
loes1 <- loess(foot_length ~ age, data=anthro, span=0.5)
visreg(loes1, "age")

GAM

Generalized additive models, a semi-parametric model, can fit splines, use factor variables, and decide on the appropriate smoothness with an automatic algorithm. We can use visreg as well:

library(mgcv)
gam1 <- gam(foot_length ~ s(age, by=gender), data=anthro)

visreg(gam1, "age", by="gender", overlay=TRUE)

Alternatively, we can use the built-in plot method for a gam. A simple trick is used to plot both on the same plot. The issue with this plot is that the intercept is excluded from the plot (find this from coef(gam1)), and I don’t know how to add it!

plot(gam1, select=1, shade=TRUE, ylim=c(-80,80), rug=FALSE,
     xlab="Age (yrs)", ylab="Foot length (mm)")
par(new=TRUE)
plot(gam1, select=2, shade=TRUE, ylim=c(-80,80), rug=FALSE, ann=FALSE)

Segmented regression

As a bonus topic, it seems the data lend themselves very well to a breakpoint analysis. Using the segmented package, we can fit this type of model, which allows a direct estimate of the age where the growth rate of feet changes, separatley for boys and girls.

library(segmented)

# First step is to fit a linear model. I do this for both genders separately.
lm_male <- lm(foot_length ~ age, data=anthro, subset=gender=="male")
lm_female <- lm(foot_length ~ age, data=anthro, subset=gender=="female")

# Second step is to let segmented decide where the optimal breakpoint is.
# If you are looking for just one breakpoint, you don't have to state how many to look for.
seg_male <- segmented(lm_male, seg.Z = ~age)
seg_female <- segmented(lm_female, seg.Z = ~age)

# Finally we can use the built-in plot method
# See ?plot.segmented for options.
palette(c("grey","cornflowerblue"))
with(anthro, plot(age, foot_length, pch=16, col=gender, cex=0.8))

plot(seg_male, add=TRUE, col="black", rug=FALSE)
plot(seg_female, add=TRUE, col="black", rug=FALSE)

Finally the estimated breakpoints and confidence intervals can be found with the segmented method for confint, as in the following example. This gives the nice short summary “Girls’ feet stop growing at age 12”.

confint(seg_male)
## $age
##   Est. CI(95%).l CI(95%).u
##  14.17     13.72     14.61
confint(seg_female)
## $age
##   Est. CI(95%).l CI(95%).u
##  12.01     11.78     12.25