Interaction Effects in Regression Analysis

Oftentimes, we expect that some X will influence some Y in a different way depending on on some third variable, Z. If this is the theoretical expectation, we need to explicitly model the interaction between X and Z in our regression.

This R tutorial will cover the following:

We will use a basic model to demonstrate how to estimate, interpret, and graph interactive results. Working out of a built-in R file containing information on home selling price in Jacksonville, FL along with a variety of variables about each home itself. Our dependent variable in this model will be home selling price while we will use 4 independent variables:

  1. Number of bedrooms
  2. Number of bathrooms
  3. Size of home in square feet
  4. New build or not

In reality, there are many additional factors that would influence home selling price but for example purposes let’s assume this is an appropriately specified model.

It is important to think about your theoretical expectations for the results prior to the estimation. One easy way to think about this, especially when the Z is dichotomous like we have here (new build = 1; older home = 0), is to think about how the X influences Y at the two different levels of Z. For instance, assume we are interested in how size of the home influences selling price. It is possible that for both new builds and older home the impact of square footage on selling price is equivalent. However, it is also possible to imagine a world where the impact of the size of the home has a differential effect on home selling price for new builds compared to older homes.

With a non-interactive regression model, we would theoretically be arguing that whether a house is a new build or not will not change the relationship between the size of the home and the selling price. However, with an interactive regression model, we would theoretically be arguing that the influence of the size of the home on selling price will be different for new or old builds. For the interactive argument, we have to explicitly include an interaction term in our regression model to test that specific theory.

Below, you see two models. The first model is a non-interactive linear regression while the second model is an interactive model. The only difference in those two lines of code are + for the non-interactive model and * for the interactive one. By changing the plus sign to *, we are telling R that we want to estimate an interactive regression between those two independent variables.

lm1<-lm(Price ~ Beds + Baths + Size + New, data=data)
lmx<-lm(Price ~ Beds + Baths + Size*New, data=data)
stargazer(lm1, lmx, type="text", digits=3)
## 
## =================================================================
##                                  Dependent variable:             
##                     ---------------------------------------------
##                                         Price                    
##                              (1)                    (2)          
## -----------------------------------------------------------------
## Beds                      -8,202.384             -5,079.731      
##                          (10,449.840)           (10,155.700)     
##                                                                  
## Baths                     5,273.777              7,632.521       
##                          (13,080.170)           (12,663.010)     
##                                                                  
## Size                      118.119***             103.258***      
##                            (12.323)               (13.036)       
##                                                                  
## New                     54,562.380***           -80,071.280      
##                          (19,214.890)           (51,615.070)     
##                                                                  
## Size:New                                         61.730***       
##                                                   (22.083)       
##                                                                  
## Constant                 -28,849.220            -19,806.980      
##                          (27,261.160)           (26,531.010)     
##                                                                  
## -----------------------------------------------------------------
## Observations                 100                    100          
## R2                          0.725                  0.746         
## Adjusted R2                 0.713                  0.732         
## Residual Std. Error  54,253.210 (df = 95)   52,406.220 (df = 94) 
## F Statistic         62.472*** (df = 4; 95) 55.126*** (df = 5; 94)
## =================================================================
## Note:                                 *p<0.1; **p<0.05; ***p<0.01

In the stargazer table above, we immediately notice a few things. First, notice how in the non-interactive model the new build and size variables are significant whereas in the interactive model it is non-significant. In this model, we would conclude that the average impact of the size of home on selling price is roughly an increase of $118 for each increase of 1 square footage. We would also conclude that for new homes, holding size, number of bedrooms and bathrooms constant, they would sell for on average \$54,562 more than an older build.

Next, look at the interactive model in column 2. We see the same non-significance on the number of bed and bathrooms as in the non-interactive model, but we also see significance on the size of the home variable and the interaction between size and new build or not. The significant coefficient on the interaction term indicates that the influence of the size of the home on home selling price is different based on if the home is a new build or not.

Interactive Model Plots

With interactive models, it is important to plot the results to better understand how the Z variable is influencing the relationship between your X and Y variables. We will use the ggpredict function from the ggeffects package to easily plot the results from our regression.

Before we plot the interactive results, we will look at the non-interactive results. In the ggpredict code, we must specify the name of the model we want to plot, lm1, and the specific independent variables we want to include as well. Here that would be the Size variable and the New variable since these are the two variables included in the interaction term. We explicitly specific that values we want our Size variable to take on. Here we start at a 700 square foot home and iterate to 4,000 square feet by 200.

Following this, we use ggplot to graph the predicted home selling price at each specified size level along with 2 separate lines for each value of the new build variable. Note, we typically want to include confidence intervals on these plots but here they are removed to simplify the results visualization.

non_int<-ggpredict(lm1, terms=c("Size [700:4000, by=200]", "New")) #Manually set break points

p1<-ggplot(non_int, aes(x = x, y = predicted, color = factor(group), group = factor(group))) +
  geom_line() +
    labs(title = "Predicted Home Selling Price with Confidence Intervals",
       x = "Square Footage",
       y = "Selling Price",
       color = "New Home") +
  scale_color_manual(values = c("0" = "blue", "1" = "red"),
                     labels = c("0" = "Old", "1" = "New")) +
  theme_minimal()

p1

Notice how the two lines run parallel to each other. This is because this is a non-interactive model. We are theoretically saying with this model that the relationship between square footage of a home and home selling price is the same for new builds as it is for old builds. The only difference between the two theoretically in this model is a different intercept since the new builds on average, holding everything else constant in the model, sells for about \$54K more than an old build.

What we know from the significant coefficient on the intereaction term in the interactive model is that the relationship between size of a home and selling price is statistically different for new builds versus old builds. Now let’s grapht the interactive model results.

We use the exact same code as above to do this except we specify the interactive model, lmx, instead of the non-interactive model name.

interactive<-ggpredict(lmx, terms=c("Size [700:4000, by=200]", "New"))

p2<- ggplot(interactive, aes(x = x, y = predicted, color = factor(group), group = factor(group))) +
  geom_line() +
  labs(title = "Predicted Home Selling Price X with New Home Build or Not with Confidence Intervals",
       x = "Square Footage",
       y = "Predicted Selling Price",
       color = "New Home or Not") +
  scale_color_manual(values = c("0" = "blue", "1" = "red"),
                     labels = c("0" = "Old", "1" = "New")) +
  theme_minimal()

ggplot(interactive, aes(x = x, y = predicted, color = factor(group), group = factor(group))) +
  geom_line() +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.05) +
  labs(title = "Predicted Home Selling Price X with New Home Build or Not with Confidence Intervals",
       x = "Square Footage",
       y = "Predicted Selling Price",
       color = "New Home or Not") +
  scale_color_manual(values = c("0" = "blue", "1" = "red"),
                     labels = c("0" = "Old", "1" = "New")) +
  theme_minimal()

p2

Notice in the produced predicted home selling price plot here that the slope for the two lines are different. This is because of the significant interaction term from the interactive model. At smaller square footage, there is no difference in home selling price between a new and older build home. However, as the size of the home increases, new builds, on average holding everything else constant, sell for significantly more money than older homes.

With this data, not including the interaction term would be missrepresenting the true relationship between square footage and home selling price because it would be assuming that its influence is the same regardless of if the home is a new build or not.

To finish, let’s look at the two plots side-by-side. Here we use the patchwork package to combine the two plots we created above.

combined_plot <- p1 + p2 + plot_layout(ncol = 2)
combined_plot

Discussion

In this tutorial, we reviewed a basic interactive regression model and reviewed interpretation of the raw regression results as well as how to graph the model. Generally, anytime you have a significant or borderline significant interaction term you should graph the results to better understand what is occurring. Here, we had clear visual evidence, as well as the significant interaction term, that the relationship between home size and home selling price varies based on if the home is new or not.