1. Interaction term

Recall that for a linear model with multiple predictors, the modeling function is:

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n\]

The meaning of coefficients \(\beta_1, \beta_2, \cdots, \beta_n\) are the increase (or decrease in \(y\)) with each unit increase in \(x_1, x_2, \cdots, x_n\), respectively, holding all other predictors constant.

Quite obviously, here we assume that the effect of each predictor is separable and there is no interaction between them. In other words, the value of \(x_2, x_3, \cdots, x_n\) has nothing to do with \(\beta_1\). This may not be true in real situations.

To study this effect, we are going to study an advertising data set from the text book Introduction to Statistical Learning. The data set can be downloaded from: https://www.statlearning.com/s/Advertising.csv


The Advertising data set

The data set has four variables, where the sales is the target variable measured in thousands of units sold for a particular product in 200 different markets. The three predictors TV, radio, and newspaper refer to advertising expenditure (in thousands of dollars) in each of the three media for different markets.

If we plot sales against each of the predictor, we obtain the following graphs:

So we see that the sales increases with all three variables on average. Now let’s model this problem and we may get some different insights.

First, let’s do single linear regression with each of the three variable.

advertising <- read_csv("~/Documents/Fei Tian/Course_DAS422_Exploratory_Data_Analysis_and_Visualization_Spring2023/Datasets/Advertising.csv")
glimpse(advertising)
## Rows: 200
## Columns: 5
## $ ...1      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ TV        <dbl> 230.1, 44.5, 17.2, 151.5, 180.8, 8.7, 57.5, 120.2, 8.6, 199.…
## $ radio     <dbl> 37.8, 39.3, 45.9, 41.3, 10.8, 48.9, 32.8, 19.6, 2.1, 2.6, 5.…
## $ newspaper <dbl> 69.2, 45.1, 69.3, 58.5, 58.4, 75.0, 23.5, 11.6, 1.0, 21.2, 2…
## $ sales     <dbl> 22.1, 10.4, 9.3, 18.5, 12.9, 7.2, 11.8, 13.2, 4.8, 10.6, 8.6…
attach(advertising)
lm.fit.tv <- lm(sales ~ TV)
summary(lm.fit.tv)
## 
## Call:
## lm(formula = sales ~ TV)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3860 -1.9545 -0.1913  2.0671  7.2124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.032594   0.457843   15.36   <2e-16 ***
## TV          0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

To interpret the result, we see that on average, an additional thousand dollar spent in TV advertising results in 0.047 thousand or 47 more sales of the product. The p-value is very low indicating that we can safely reject the null hypothesis that TV advertising has no effect (\(\beta_1 = 0\)).

Similarly, let’s now do this for the other two variables:

attach(advertising)
lm.fit.tv <- lm(sales ~ radio)
summary(lm.fit.tv)
## 
## Call:
## lm(formula = sales ~ radio)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.7305  -2.1324   0.7707   2.7775   8.1810 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.31164    0.56290  16.542   <2e-16 ***
## radio        0.20250    0.02041   9.921   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.275 on 198 degrees of freedom
## Multiple R-squared:  0.332,  Adjusted R-squared:  0.3287 
## F-statistic: 98.42 on 1 and 198 DF,  p-value: < 2.2e-16
attach(advertising)
lm.fit.tv <- lm(sales ~ newspaper)
summary(lm.fit.tv)
## 
## Call:
## lm(formula = sales ~ newspaper)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2272  -3.3873  -0.8392   3.5059  12.7751 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.35141    0.62142   19.88  < 2e-16 ***
## newspaper    0.05469    0.01658    3.30  0.00115 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.092 on 198 degrees of freedom
## Multiple R-squared:  0.05212,    Adjusted R-squared:  0.04733 
## F-statistic: 10.89 on 1 and 198 DF,  p-value: 0.001148

So we see that each additional thousand dollar spent in radio and newspaper advertising results in 203 and 55 more sales, respectively. It seems that radio is the most effective advertising method by a large margin, and TV/newspaper has similar effects.

Now let’s do the multiple linear regression. That is, to study the effects of all three variables altogether.

lm.fit.all <- lm(sales ~ TV + radio + newspaper)
summary(lm.fit.all)
## 
## Call:
## lm(formula = sales ~ TV + radio + newspaper)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8277 -0.8908  0.2418  1.1893  2.8292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.938889   0.311908   9.422   <2e-16 ***
## TV           0.045765   0.001395  32.809   <2e-16 ***
## radio        0.188530   0.008611  21.893   <2e-16 ***
## newspaper   -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

This result is different from those of simple linear regression. Now the coefficient of newspaper is very small and its p-value suggests that there is no significant correlation between sales and newspaper.

Does it make sense for the multiple regression to suggest no relationship between sales and newspaper while the simple linear regression implies the opposite? In fact it does. Consider the correlation matrix for the three predictor variables and response variable:

cor(advertising)
##                  ...1         TV       radio   newspaper       sales
## ...1       1.00000000 0.01771469 -0.11068044 -0.15494414 -0.05161625
## TV         0.01771469 1.00000000  0.05480866  0.05664787  0.78222442
## radio     -0.11068044 0.05480866  1.00000000  0.35410375  0.57622257
## newspaper -0.15494414 0.05664787  0.35410375  1.00000000  0.22829903
## sales     -0.05161625 0.78222442  0.57622257  0.22829903  1.00000000

Notice that the correlation between radio and newspaper is 0.35. This indicates that markets with high newspaper advertising tend to also have high radio advertising.

Now suppose that the multiple regression is correct and newspaper advertising is not associated with sales, but radio advertising is associated with sales. Then in markets where we spend more on radio our sales will tend to be higher, and as our correlation matrix shows, we also tend to spend more on newspaper advertising in those same markets.

Hence, in a simple linear regression which only examines sales versus newspaper, we will observe that higher values of newspaper tend to be associated with higher values of sales, even though newspaper advertising is not directly associated with sales. So newspaper advertising is a surrogate for radio advertising; newspaper gets “credit” for the association between radio on sales.

There is rich theory and information behind the analysis above. Here we only give a brief introduction to this example. For more details, you may refer to the textbook of Introduction to Statistical Learning.

It is also worthwhile to note that regression models only capture association, not causation. So there is no guarantee that if we increase expenditure in radio advertising, there will be more sales observed.


Interaction effect

Now go back to the interaction effect. The statistical model for a multiple linear regression is:

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n + \epsilon\] where \(\epsilon\) is assumed to incorporate all related factors to \(y\) that are missing in the data set so we basically cannot model them using \(\beta_1, \beta_2, \cdots, \beta_n\). This term is called irreducible error since it’s not possible to predict them without our model.

In linear regression theory, \(\epsilon\) is assumed to satisfy

  • \(\epsilon\) should be normally distributed
  • \(\epsilon\) should be independent of all predictors
  • \(\epsilon\) should have a constant variance

In real problems, either assumption can often be invalid. Particularly, the second issue sometimes can be resolved by introducing interaction terms and nonlinear terms, as shown below.

Now we already know that the association between sales and newspaper is weak if we consider radio. Therefore let’s remove newspaper from our model.

lm.fit.tvradio <- lm(sales ~ TV + radio)

Now let’s check the residual plot to see whether there is any non-linear effect or heterogeneous variance:

plot(lm.fit.tvradio, which = 1)

So we see a relatively clear non-linear trend here, which suggests potential non-linear relationship.

Now let’s see whether the residual is dependent of \(TV \times radio\) here.

advertising <- advertising %>%
  mutate(res_tvradio = lm.fit.tvradio$residual) 

ggplot(advertising, mapping = aes(TV*radio, res_tvradio)) +
  geom_point() + 
  geom_smooth(method = "lm", color = "red")

Here we first add the residual of the last fit into the data set, and then plot it against \(TV \times radio\). We see a pretty obvious increasing trend here. So this term should indeed be used in the fit.


Synergy effect

The product of two variables, such as \(TV \times radio\), is called an interaction term. Now the regression model becomes:

\[ sales = \beta_0 + \beta_1 \times TV + \beta_2 \times radio + \beta_{12} \times TV \times radio\] What is the meaning of this model? It simply means that when there is a unit increase in TV advertising expenditure, the increase in sales is no longer a constant and depends on the value of radio as well.

\[ sales = \beta_0 + (\beta_1 + \beta_{12} \times radio) \times TV + \beta_2 \times radio\] Symmetrically, a unit increase in radio advertising expenditure also results in different changes in sales, depending on the value of TV.

$ sales = _0 + _1 TV + (2 + {12} TV)radio$$

Such effect is commonly seen in real life, and is called a synergy effect. That is, if the company does TV and radio advertising at the same time, each medias help the other to be more effective.


Adding interaction term in R

In R, adding an interaction term can be done in a few ways:

lm(target ~ var1 + var2 + var1:var2)
lm(target ~ var1 * var2)
lm(target ~ (var1 + var2)^2)

All three ways of writing the formula give the same model:

\[ target = \beta_0 + \beta_1 \times var1 + \beta_2 \times var2 + \beta_{12} \times var1 \times var2\] You should be noted that in the R formula, ~ (var1 + var2)^2 is not the same as \((var1^2 + var2^2 + 2*var1*var2)\), neither does var1 * var2 mean \((var1 \times var2)\).

Now let’s do the regression using the new model:

lm.fit.interact <- lm(sales ~ (TV + radio)^2)
plot(lm.fit.interact, which = 1)

We see that there is still some nonlinear effect left. But the magnitude of residuals become smaller overall, so this is a better fit than the previous one (which is not necessarily always good due to potential overfitting problem).

Let’s look at the fit summary:

coef(lm.fit.interact)
## (Intercept)          TV       radio    TV:radio 
## 6.750220203 0.019101074 0.028860340 0.001086495
summary(lm.fit.interact)
## 
## Call:
## lm(formula = sales ~ (TV + radio)^2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3366 -0.4028  0.1831  0.5948  1.5246 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.750e+00  2.479e-01  27.233   <2e-16 ***
## TV          1.910e-02  1.504e-03  12.699   <2e-16 ***
## radio       2.886e-02  8.905e-03   3.241   0.0014 ** 
## TV:radio    1.086e-03  5.242e-05  20.727   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared:  0.9678, Adjusted R-squared:  0.9673 
## F-statistic:  1963 on 3 and 196 DF,  p-value: < 2.2e-16

We see all three coefficients have a low p-value, suggesting that we should reject the null hypothesis that they are zero. So all three terms have a non-negligible effect on sales in this model.


Polynomial regression

Next, let’s see whether we can improve the model to account for the remaining nonlinear effect in the residual plot. We may see that which variable is accounting for this:

plot(advertising$TV, lm.fit.interact$residuals)

plot(advertising$radio, lm.fit.interact$residuals)

plot(advertising$radio * advertising$TV, lm.fit.interact$residuals)

So it seems that the residual is a quadratic function of TV. Then let’s throw the \(TV^2\) term into the model.

lm.fit.interact2 <- lm(sales ~ (TV + radio)^2 + I(TV^2))
summary(lm.fit.interact2)
## 
## Call:
## lm(formula = sales ~ (TV + radio)^2 + I(TV^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9949 -0.2969 -0.0066  0.3798  1.1686 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.137e+00  1.927e-01  26.663  < 2e-16 ***
## TV           5.092e-02  2.232e-03  22.810  < 2e-16 ***
## radio        3.516e-02  5.901e-03   5.959 1.17e-08 ***
## I(TV^2)     -1.097e-04  6.893e-06 -15.920  < 2e-16 ***
## TV:radio     1.077e-03  3.466e-05  31.061  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6238 on 195 degrees of freedom
## Multiple R-squared:  0.986,  Adjusted R-squared:  0.9857 
## F-statistic:  3432 on 4 and 195 DF,  p-value: < 2.2e-16

To throw in a quadratic term, we use I(TV^2). The formula in I() will be interpreted just as what it is. Now let’s look at our residual plots.

plot(advertising$TV, lm.fit.interact2$residuals)

plot(advertising$radio, lm.fit.interact2$residuals)

plot(advertising$radio * advertising$TV, lm.fit.interact2$residuals)

There is still some visible nonlinear relationship between the residuals and TV. So we can keep throw more power terms of TV into the model, such as \(TV^3\), \(TV^4\), \(TV^5\), until this association disappears.

lm.fit.interact5 <- lm(sales ~ (TV + radio)^2 + I(TV^2) + I(TV^3) + I(TV^4) + I(TV^5))
summary(lm.fit.interact5)
## 
## Call:
## lm(formula = sales ~ (TV + radio)^2 + I(TV^2) + I(TV^3) + I(TV^4) + 
##     I(TV^5))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.96970 -0.17890  0.00147  0.20770  1.03572 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.698e+00  2.070e-01  13.034  < 2e-16 ***
## TV           2.015e-01  1.271e-02  15.856  < 2e-16 ***
## radio        4.302e-02  3.906e-03  11.014  < 2e-16 ***
## I(TV^2)     -2.621e-03  2.649e-04  -9.893  < 2e-16 ***
## I(TV^3)      1.743e-05  2.258e-06   7.719 6.27e-13 ***
## I(TV^4)     -5.473e-08  8.381e-09  -6.530 5.74e-10 ***
## I(TV^5)      6.467e-11  1.124e-11   5.752 3.43e-08 ***
## TV:radio     1.036e-03  2.288e-05  45.272  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4087 on 192 degrees of freedom
## Multiple R-squared:  0.9941, Adjusted R-squared:  0.9939 
## F-statistic:  4605 on 7 and 192 DF,  p-value: < 2.2e-16
plot(lm.fit.interact5, which = 1)

plot(lm.fit.interact5, which = 2)

In principle, these models are still linear models in the sense that the target variable is still linear to multiple predictors, but the predictors themselves become nonlinear functions of original predictors. If those functions are power terms, it is called polynomial regression.

Now we can visualise the goodness of fit with the original data. Since we have two variables involved (other predictors are simply functions of TV and radio), it is still possible to visualise the data and our fitting plane in 3D. Here we use the function plotPlane from the package rockchalk.

library(rockchalk)

par(mfrow=c(1,2), mar=c(1,1,1,1))

plotPlane(lm.fit.interact5, "TV", "radio", pch=16, col=rgb(0,0,1,0.1), drawArrows=TRUE, alength=0, acol="red", alty=1,alwd=1, theta=315, phi=10)

plotPlane(lm.fit.interact5, "TV", "radio", pch=16, col=rgb(0,0,1,0.1), drawArrows=TRUE, alength=0, acol="red", alty=1,alwd=1, theta=315, phi=-10)

So we see that our fit is already quite accurate and the error is pretty small overall.


Lab Exercise: Understand the following code and see what it does

lm.fit.tvpoly <- lm(sales ~ poly(TV, 5))
plot(TV, sales)
ix <- sort(TV, index.return=T)$ix
lines(TV[ix], predict(lm.fit.tvpoly)[ix], col = "red", lwd = 2)