Lecture 7: Regression 2

Olesya Volchenko and Anna Shirokanova

April 17, 2023

In the previous lecture you learned that…

  1. Linear regression serves to estimate the relationships between a set of predictors and a linear outcome variable

  2. Regression coefficients quantify the direction and strength of these relationships

  3. Correctness of regression coefficients relies on a number of assumptions (linearity, normally distributed, homoscedastic residuals, etc.)

A brief recap: nuts and bolts of regression modelling

Today you will learn:

Many predictors

The major advantage of regression modelling is its ability to account for many predictors.

How many predictors can we add to one regression model?

What is a good model?

Ways to compare your models

Model comparison with anova()

Let’s say we have 2 models. One contains only x1 as a predictor and another one contains both x1 and x2.

Therefore model comparison will show us whether adding x2 as a predictor improves the model.

m1 <- lm(y ~ x1, data = data)
m2 <- lm(y ~ x1 + x2, data = data)
tab_model(m1, m2, show.ci = F)
  y y
Predictors Estimates p Estimates p
(Intercept) -0.09 0.522 -0.14 0.127
x1 0.79 <0.001 0.96 <0.001
x2 1.05 <0.001
Observations 100 100
R2 / R2 adjusted 0.210 / 0.202 0.679 / 0.672
anova(m1, m2)
## Analysis of Variance Table
## 
## Model 1: y ~ x1
## Model 2: y ~ x1 + x2
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1     98 208.624                                  
## 2     97  84.722  1     123.9 141.86 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Read more on comparing models with anova here: https://bookdown.org/ndphillips/YaRrr/comparing-regression-models-with-anova.html

Nested models

anova is only helpfull when you are comparing nested models.

Models considered to be nested if a simpler model can be obtained by removing predictors from a more complex model.

For example,

Model 1: y = a + b

Model 2: y = a + b + c

Model 3: y = a + c

Models 1 and 2 are nested, Models 3 and 2 are nested, Models 1 and 3 are not nested.

An illustration of model comparison

An illustration of model comparison

The general idea of control variables

In experiments we can randomize and control conditions.

But it is not true for observational studies. -> Therefore we need to control for a set of variables (usually socio-demographics) in order to take those uncontrolled differences into account and be able to tell whether the predictor of interest is indeed related to the outcome.

Example: happiness and Internet use

How controlling for a variable looks like

Why do we need control variables? An example

Source: Blavatskyy, P. (2021). Obesity of politicians and corruption in post‐Soviet countries. Economics of Transition and Institutional Change, 29(2), 343-356.

Any questions so far?

Additivity is an assumption of linear regression

This is all good if you observe parallel slopes

recall the ‘parallel slopes’ model from the Correlation and Regression DataCamp course:

https://newonlinecourses.science.psu.edu/stat502/node/187/

But what if your hypothesis states that some groups have it different than others? For example:

The general idea of moderation

Theory will tell which predictor is X and which is Z

Three coefficients to estimate

Interaction effect in simple words

How to check for interactions in a linear model?

Example 1: Does the relationship between Length and Width depend on Species? (iris data)

M_add <- Y ~ X + Z # additive model

M_int <- Y ~ X * Z # interaction model

anova(M_add, M_int) # compare

  Petal Length Petal Length
Predictors Estimates p Estimates p
(Intercept) 1.18 0.019 -0.18 0.596
Sepal Width 0.08 0.576 0.48 <0.001
Species [versicolor] 0.75 0.284 3.11 <0.001
Species [virginica] 2.33 0.001 4.31 <0.001
Sepal Width × Species
[versicolor]
0.76 0.001
Sepal Width × Species
[virginica]
0.60 0.008
Observations 150 150
R2 / R2 adjusted 0.954 / 0.952 0.950 / 0.949
## Analysis of Variance Table
## 
## Model 1: Petal.Length ~ Sepal.Width * Species
## Model 2: Petal.Length ~ Sepal.Width + Species
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1    144 21.376                                
## 2    146 23.335 -2   -1.9587 6.5973 0.001814 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let’s explore interactions with a simulated dataset

##            y          x cond
## 1  3.5416709 -0.6264538    a
## 2  0.7706567  0.1836433    a
## 3  5.7647315 -0.8356286    a
## 4 -8.3073118  1.5952808    a
## 5 -3.9327744  0.3295078    a
## 6  6.6000035 -0.8204684    a
##        y                  x            cond   
##  Min.   :-11.8029   Min.   :-2.21470   a:100  
##  1st Qu.: -0.7279   1st Qu.:-0.61383   b:100  
##  Median :  3.8151   Median :-0.04937          
##  Mean   :  2.2873   Mean   : 0.03554          
##  3rd Qu.:  5.5384   3rd Qu.: 0.61300          
##  Max.   :  9.7033   Max.   : 2.40162

Let’s start with an additive model

## 
## Call:
## lm(formula = y ~ x + cond, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3161 -1.4148 -0.1941  1.3910  4.8887 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.1997     0.2091  -0.955    0.341    
## x            -2.8938     0.1595 -18.144   <2e-16 ***
## condb         5.1797     0.2956  17.521   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.084 on 197 degrees of freedom
## Multiple R-squared:  0.7781, Adjusted R-squared:  0.7759 
## F-statistic: 345.4 on 2 and 197 DF,  p-value: < 2.2e-16

Now, let’s estimate an interaction between cond and x

## 
## Call:
## lm(formula = y ~ x * cond, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.91411 -0.55945 -0.05044  0.64814  2.64157 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.02737    0.10250   0.267     0.79    
## x           -4.97883    0.11384 -43.734   <2e-16 ***
## condb        5.02195    0.14448  34.760   <2e-16 ***
## x:condb      3.91835    0.15607  25.107   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.017 on 196 degrees of freedom
## Multiple R-squared:  0.9474, Adjusted R-squared:  0.9466 
## F-statistic:  1176 on 3 and 196 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Model 1: y ~ x + cond
## Model 2: y ~ x * cond
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    197 855.42                                  
## 2    196 202.89  1    652.52 630.36 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now, let’s see on a plot where all the coefficients are

Now, let’s see on a plot where all the coefficients are

Now, let’s see on a plot where all the coefficients are

Now, let’s see on a plot where all the coefficients are

Now, let’s see on a plot where all the coefficients are

Now, your turn to draw an interaction

Draw an interaction plot for the following equation \[y = 2 + 1*x - 3*cond(b) + 1*x*cond(b)\] Can you tell, whether the relationship between x and y is stronger in group a or b?

Now, your turn to draw an interaction

Draw an interaction plot for the following equation \[y = 2 + 1*x - 3*cond(b) + 1*x*cond(b)\] Can you tell, whether the relationship between x and y is stronger in group a or b?

Red line for condition a and a blue one for b.

Example: Interaction example: continuous by continuous variable

Problem: Predict aggression by hours of playing violent video games and lack of empathy:

Interaction: continuous by continuous variable: Interpretation

Interaction: continuous by continuous variable: Interpretation

Interaction: continuous by continuous variable: Interpretation

Interaction: continuous by continuous variable: R code

library(foreign)
df <- read.spss("Video Games.sav", to.data.frame = T, use.value.labels = T)
lmfit1 <- lm(Aggression ~ Vid_Games, data = df)
lmfit2 <- lm(Aggression ~ Vid_Games + CaUnTs, data = df)
lmfit3 <- lm(Aggression ~ Vid_Games * CaUnTs, data = df)
anova(lmfit1, lmfit2, lmfit3)
## Analysis of Variance Table
## 
## Model 1: Aggression ~ Vid_Games
## Model 2: Aggression ~ Vid_Games + CaUnTs
## Model 3: Aggression ~ Vid_Games * CaUnTs
##   Res.Df   RSS Df Sum of Sq       F    Pr(>F)    
## 1    440 68789                                   
## 2    439 45088  1   23700.2 238.129 < 2.2e-16 ***
## 3    438 43593  1    1495.7  15.028 0.0001221 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tab_model(lmfit1, lmfit2, lmfit3)
  Aggression Aggression Aggression
Predictors Estimates CI p Estimates CI p Estimates CI p
(Intercept) 34.84 30.99 – 38.70 <0.001 21.76 18.21 – 25.32 <0.001 33.12 26.38 – 39.86 <0.001
Vid Games 0.24 0.07 – 0.41 0.006 0.19 0.05 – 0.32 0.007 -0.33 -0.63 – -0.04 0.027
CaUnTs 0.76 0.66 – 0.86 <0.001 0.17 -0.15 – 0.49 0.295
Vid Games × CaUnTs 0.03 0.01 – 0.04 <0.001
Observations 442 442 442
R2 / R2 adjusted 0.017 / 0.015 0.356 / 0.353 0.377 / 0.373
library(interplot)
interplot(m = lmfit3, var1 = "Vid_Games", var2 = "CaUnTs")+ 
    xlab("CaUnTs") +
    ylab("Estimated Coefficient for Vid_Games")

plot_model(lmfit3, type = "int")

Can I add an interaction if both direct effects are insignificant?

Yes, if you have a theory behind it.

Let’s take a look at one example

##            y          x cond
## 1 -18.792843 -17.976378    a
## 2  -7.790157  -6.435724    a
## 3  -4.874391   4.676773    a
## 4  -5.218673   0.128643    a
## 5   9.837180  -2.170496    a
## 6   1.052459  13.151498    a
## 
## Call:
## lm(formula = y ~ x + cond, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.675 -10.606   0.136  10.384  30.461 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  2.57415    1.38290   1.861   0.0642 .
## x           -0.04272    0.10024  -0.426   0.6705  
## condb       -1.91426    1.95613  -0.979   0.3290  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.83 on 197 degrees of freedom
## Multiple R-squared:  0.005658,   Adjusted R-squared:  -0.004436 
## F-statistic: 0.5605 on 2 and 197 DF,  p-value: 0.5718

## 
## Call:
## lm(formula = y ~ x * cond, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -27.4897  -6.4649   0.3047   6.7771  25.2179 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.4345     0.9759   2.495   0.0134 *  
## x             1.0021     0.1023   9.792   <2e-16 ***
## condb        -2.0668     1.3804  -1.497   0.1360    
## x:condb      -2.0008     0.1416 -14.128   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.758 on 196 degrees of freedom
## Multiple R-squared:  0.5073, Adjusted R-squared:  0.4998 
## F-statistic: 67.28 on 3 and 196 DF,  p-value: < 2.2e-16
  y
Predictors Estimates CI p
(Intercept) 2.43 0.51 – 4.36 0.013
x 1.00 0.80 – 1.20 <0.001
cond [b] -2.07 -4.79 – 0.66 0.136
x × cond [b] -2.00 -2.28 – -1.72 <0.001
Observations 200
R2 / R2 adjusted 0.507 / 0.500

Summary

Where to practise all that?

Questions?