Oftentimes, we expect that some X will influence some Y in a different way depending on on some third variable, Z. If this is the theoretical expectation, we need to explicitly model the interaction between X and Z in our regression.
This R tutorial will cover the following:
We will use a basic model to demonstrate how to estimate, interpret, and graph interactive results. Working out of a built-in R file containing information on home selling price in Jacksonville, FL along with a variety of variables about each home itself. Our dependent variable in this model will be home selling price while we will use 4 independent variables:
In reality, there are many additional factors that would influence home selling price but for example purposes let’s assume this is an appropriately specified model.
It is important to think about your theoretical expectations for the results prior to the estimation. One easy way to think about this, especially when the Z is dichotomous like we have here (new build = 1; older home = 0), is to think about how the X influences Y at the two different levels of Z. For instance, assume we are interested in how size of the home influences selling price. It is possible that for both new builds and older home the impact of square footage on selling price is equivalent. However, it is also possible to imagine a world where the impact of the size of the home has a differential effect on home selling price for new builds compared to older homes.
With a non-interactive regression model, we would theoretically be arguing that whether a house is a new build or not will not change the relationship between the size of the home and the selling price. However, with an interactive regression model, we would theoretically be arguing that the influence of the size of the home on selling price will be different for new or old builds. For the interactive argument, we have to explicitly include an interaction term in our regression model to test that specific theory.
Below, you see two models. The first model is a non-interactive
linear regression while the second model is an interactive model. The
only difference in those two lines of code are +
for the
non-interactive model and *
for the interactive one. By
changing the plus sign to *
, we are telling R that we want
to estimate an interactive regression between those two independent
variables.
<-lm(Price ~ Beds + Baths + Size + New, data=data)
lm1<-lm(Price ~ Beds + Baths + Size*New, data=data)
lmxstargazer(lm1, lmx, type="text", digits=3)
##
## =================================================================
## Dependent variable:
## ---------------------------------------------
## Price
## (1) (2)
## -----------------------------------------------------------------
## Beds -8,202.384 -5,079.731
## (10,449.840) (10,155.700)
##
## Baths 5,273.777 7,632.521
## (13,080.170) (12,663.010)
##
## Size 118.119*** 103.258***
## (12.323) (13.036)
##
## New 54,562.380*** -80,071.280
## (19,214.890) (51,615.070)
##
## Size:New 61.730***
## (22.083)
##
## Constant -28,849.220 -19,806.980
## (27,261.160) (26,531.010)
##
## -----------------------------------------------------------------
## Observations 100 100
## R2 0.725 0.746
## Adjusted R2 0.713 0.732
## Residual Std. Error 54,253.210 (df = 95) 52,406.220 (df = 94)
## F Statistic 62.472*** (df = 4; 95) 55.126*** (df = 5; 94)
## =================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
In the stargazer
table above, we immediately notice a
few things. First, notice how in the non-interactive model the new build
and size variables are significant whereas in the interactive model it
is non-significant. In this model, we would conclude that the average
impact of the size of home on selling price is roughly an increase of
$118 for each increase of 1 square footage. We would also conclude that
for new homes, holding size, number of bedrooms and bathrooms constant,
they would sell for on average \$54,562 more than an older build.
Next, look at the interactive model in column 2. We see the same non-significance on the number of bed and bathrooms as in the non-interactive model, but we also see significance on the size of the home variable and the interaction between size and new build or not. The significant coefficient on the interaction term indicates that the influence of the size of the home on home selling price is different based on if the home is a new build or not.
With interactive models, it is important to plot the results to
better understand how the Z variable is influencing the relationship
between your X and Y variables. We will use the ggpredict
function from the ggeffects
package to easily plot the
results from our regression.
Before we plot the interactive results, we will look at the
non-interactive results. In the ggpredict
code, we must
specify the name of the model we want to plot, lm1
, and the
specific independent variables we want to include as well. Here that
would be the Size
variable and the New
variable since these are the two variables included in the interaction
term. We explicitly specific that values we want our Size
variable to take on. Here we start at a 700 square foot home and iterate
to 4,000 square feet by 200.
Following this, we use ggplot
to graph the predicted
home selling price at each specified size level along with 2 separate
lines for each value of the new build variable. Note, we typically want
to include confidence intervals on these plots but here they are removed
to simplify the results visualization.
<-ggpredict(lm1, terms=c("Size [700:4000, by=200]", "New")) #Manually set break points
non_int
<-ggplot(non_int, aes(x = x, y = predicted, color = factor(group), group = factor(group))) +
p1geom_line() +
labs(title = "Predicted Home Selling Price with Confidence Intervals",
x = "Square Footage",
y = "Selling Price",
color = "New Home") +
scale_color_manual(values = c("0" = "blue", "1" = "red"),
labels = c("0" = "Old", "1" = "New")) +
theme_minimal()
p1
Notice how the two lines run parallel to each other. This is because this is a non-interactive model. We are theoretically saying with this model that the relationship between square footage of a home and home selling price is the same for new builds as it is for old builds. The only difference between the two theoretically in this model is a different intercept since the new builds on average, holding everything else constant in the model, sells for about \$54K more than an old build.
What we know from the significant coefficient on the intereaction term in the interactive model is that the relationship between size of a home and selling price is statistically different for new builds versus old builds. Now let’s grapht the interactive model results.
We use the exact same code as above to do this except we specify the
interactive model, lmx
, instead of the non-interactive
model name.
<-ggpredict(lmx, terms=c("Size [700:4000, by=200]", "New"))
interactive
<- ggplot(interactive, aes(x = x, y = predicted, color = factor(group), group = factor(group))) +
p2geom_line() +
labs(title = "Predicted Home Selling Price X with New Home Build or Not with Confidence Intervals",
x = "Square Footage",
y = "Predicted Selling Price",
color = "New Home or Not") +
scale_color_manual(values = c("0" = "blue", "1" = "red"),
labels = c("0" = "Old", "1" = "New")) +
theme_minimal()
ggplot(interactive, aes(x = x, y = predicted, color = factor(group), group = factor(group))) +
geom_line() +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.05) +
labs(title = "Predicted Home Selling Price X with New Home Build or Not with Confidence Intervals",
x = "Square Footage",
y = "Predicted Selling Price",
color = "New Home or Not") +
scale_color_manual(values = c("0" = "blue", "1" = "red"),
labels = c("0" = "Old", "1" = "New")) +
theme_minimal()
p2
Notice in the produced predicted home selling price plot here that the slope for the two lines are different. This is because of the significant interaction term from the interactive model. At smaller square footage, there is no difference in home selling price between a new and older build home. However, as the size of the home increases, new builds, on average holding everything else constant, sell for significantly more money than older homes.
With this data, not including the interaction term would be missrepresenting the true relationship between square footage and home selling price because it would be assuming that its influence is the same regardless of if the home is a new build or not.
To finish, let’s look at the two plots side-by-side. Here we use the
patchwork
package to combine the two plots we created
above.
<- p1 + p2 + plot_layout(ncol = 2)
combined_plot combined_plot
In this tutorial, we reviewed a basic interactive regression model and reviewed interpretation of the raw regression results as well as how to graph the model. Generally, anytime you have a significant or borderline significant interaction term you should graph the results to better understand what is occurring. Here, we had clear visual evidence, as well as the significant interaction term, that the relationship between home size and home selling price varies based on if the home is new or not.