The Data

Recall the iris dataset with measurements about 150 iris flowers:

iris

The reader will recall that Species, a categorical variable, has three levels: setosa, versicolor, and virginica. We consider two models for predicting Sepal.Length (chosen only because it’s conveniently the first variable in the data frame).

First Model: Predicting With a Categorical Variable

Here’s a model predicting Sepal.Length based only upon a single categorical variable: Species. What does the model do? Simple: it predicts the mean for each species! Take a look at the model coefficients:

speciesMod <- lm(data=iris, 
                 Sepal.Length~Species)

pander(summary(speciesMod)) 
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.006 0.0728 68.76 1.134e-113
Speciesversicolor 0.93 0.103 9.033 8.77e-16
Speciesvirginica 1.582 0.103 15.37 2.215e-32
Fitting linear model: Sepal.Length ~ Species
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
150 0.5148 0.6187 0.6135

Note a seeming deficiency in the model: only versicolor and virginica have coefficients. What happened to setosa?

In our “dummy variable” model, one level of the category is absorbed into the intercept as a baseline (here, setosa is our baseline, chosen only by alphabetical order). Other levels (in this case versicolor and virginica) are expressed in binary (which is to say either \(x =\) 0 or \(x =\) 1) relative to the baseline.

Specifically: the mean for setosa Sepal.Length is 5.006cm, and since the coefficient for versicolor is 0.9300 we know that the mean Sepal.Length for versicolor irises is 0.9300cm larger than than the mean for setosas: 5.9306cm. A similar adjustment is made for virginica irises, which have a mean Sepal.Lengthof 6.588cm

All together, the model:

\[\hat{y}=5.006+0.930 x_1 + 1.582 x_2\] Here, \(x_1\) is a binary “dummy variable” that indicates versicolor irises and \(x_2\) is a binary variable that indicates virginica plants.

Those are the coefficients of the model, but what does it look like?

iris %>% ggplot(aes(Sepal.Width, Sepal.Length, color=Species))+
  geom_point()+
  geom_abline(intercept = c(5.0060, 5.0060+.9300, 5.0060+1.5820),
              slope=0, color=c("red", "green3", "blue"))

The picture is clear: regressing on a categorical variable simply predicts the mean for each category!

Regressing on Two Variables: Categorical and Quantitative

Ok, what if we use an additional quantitative variable? For example, what if we predict Sepal.Length based on both Species and Sepal.Length?

speciesAndSepalWidthMod <- lm(data=iris,
                              Sepal.Length~Sepal.Width+Species)

summary(speciesAndSepalWidthMod) %>% pander(justify="center")
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.251 0.3698 6.089 9.568e-09
Sepal.Width 0.8036 0.1063 7.557 4.187e-12
Speciesversicolor 1.459 0.1121 13.01 3.478e-26
Speciesvirginica 1.947 0.1 19.47 2.094e-42
Fitting linear model: Sepal.Length ~ Sepal.Width + Species
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
150 0.438 0.7259 0.7203

Observe that Sepal.Width is a numerical variable, and its coefficient is a slope. All other coefficients merely affect the intercept. Take a look:

qplot(data=iris, 
      x=Sepal.Width, y=Sepal.Length, 
      color=Species)+
  geom_abline(intercept = c(2.2514,2.2514+1.4587,2.2514+1.9486),
              slope=.8036, color=c("red", "green3", "blue"))

Using Multiple Variables

Just for fun, we consider a model for Sepal.Length using every other variable in the iris data frame.

finalModel <- lm(data=iris, Sepal.Length~.)
summary(finalModel) %>% pander
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.171 0.2798 7.76 1.43e-12
Sepal.Width 0.4959 0.08607 5.761 4.868e-08
Petal.Length 0.8292 0.06853 12.1 1.074e-23
Petal.Width -0.3152 0.1512 -2.084 0.03889
Speciesversicolor -0.7236 0.2402 -3.013 0.00306
Speciesvirginica -1.023 0.3337 -3.067 0.002584
Fitting linear model: Sepal.Length ~ .
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
150 0.3068 0.8673 0.8627

Here, our model is

\[ \hat{y} = 2.171 - 0.7236x_1 -1.023x_2+0.4959x_3+0.8292x_4-0.3152x_5 \] where \(x_1\) and \(x_2\) are dummy variables for versicolor and virginica, and \(x_3\), \(x_4\), and \(x_5\) are the quantitative measurements for Sepal.Width, Petal.Length, and Petal.Width.

Since a four-dimensional visualization of the model is inconvenient, we defer to a diagnostic plot:

autoplot(finalModel)

We see that 86.27% of the total variation in Sepal.Length can be explained by the relationship with the other variables Sepal.Width, Petal.Length, Petal.Width, and Species.

Observe that our residuals don’t exhibit any “bad behavior”, in particular: