Lecture 7: Regression 2

Olesya Volchenko and Anna Shirokanova

April 17, 2023

In the previous lecture you learned that…

Linear regression serves to estimate the relationships between a set of predictors and a linear outcome variable
Regression coefficients quantify the direction and strength of these relationships
Correctness of regression coefficients relies on a number of assumptions (linearity, normally distributed, homoscedastic residuals, etc.)

A brief recap: nuts and bolts of regression modelling

Input
- outcome - 1 pc.
- predictors - >= 1 pc.
Output
- regression coefficients
- R²

Today you will learn:

How to select predictors
How to estimate an interaction effect

Many predictors

The major advantage of regression modelling is its ability to account for many predictors.

How many predictors can we add to one regression model?

Depends on the theoretical model and the dataset
Rule of thumb: one predictor per 20 observations (e.g., no more than 5 for 100 observations)
Avoid garbage in, garbage out models (too many predictors with noise, unclear relationships)
Formulas exist (Tabachnick & Fidell): 108 + k for the overall model fit; 50 + 8*k for individual predictor coefficients.

What is a good model?

describes data well
parsimonious (the simpler the better)

Ways to compare your models

you can compare \(R^2\) or adjusted \(R^2\) across several models.
you can use anova to see whether you significantly decreased RSS by adding another predictor.

Model comparison with anova()

Let’s say we have 2 models. One contains only x1 as a predictor and another one contains both x1 and x2.

Therefore model comparison will show us whether adding x2 as a predictor improves the model.

m1 <- lm(y ~ x1, data = data)
m2 <- lm(y ~ x1 + x2, data = data)
tab_model(m1, m2, show.ci = F)

	y		y
Predictors	Estimates	p	Estimates	p
(Intercept)	-0.09	0.522	-0.14	0.127
x1	0.79	<0.001	0.96	<0.001
x2			1.05	<0.001
Observations	100		100
R² / R² adjusted	0.210 / 0.202		0.679 / 0.672

anova(m1, m2)

## Analysis of Variance Table
## 
## Model 1: y ~ x1
## Model 2: y ~ x1 + x2
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1     98 208.624                                  
## 2     97  84.722  1     123.9 141.86 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Read more on comparing models with anova here: https://bookdown.org/ndphillips/YaRrr/comparing-regression-models-with-anova.html

Nested models

anova is only helpfull when you are comparing nested models.

Models considered to be nested if a simpler model can be obtained by removing predictors from a more complex model.

For example,

Model 1: y = a + b

Model 2: y = a + b + c

Model 3: y = a + c

Models 1 and 2 are nested, Models 3 and 2 are nested, Models 1 and 3 are not nested.

An illustration of model comparison

The general idea of control variables

In experiments we can randomize and control conditions.

But it is not true for observational studies. -> Therefore we need to control for a set of variables (usually socio-demographics) in order to take those uncontrolled differences into account and be able to tell whether the predictor of interest is indeed related to the outcome.

Example: happiness and Internet use

How controlling for a variable looks like

Why do we need control variables? An example

Source: Blavatskyy, P. (2021). Obesity of politicians and corruption in post‐Soviet countries. Economics of Transition and Institutional Change, 29(2), 343-356.

Any questions so far?

Additivity is an assumption of linear regression

When adding new predictors to the model, we imply they have independent relationship with the outcome.
In practice, the coefficient between X and Y can depend on the level of Z.
Z is called a ‘moderator’ variable, while the situation is called a ‘moderated relationship.’
In experimental research, it is also called the interaction effect.
How to check: Try models and choose a better one: y~x+z or y~x*z

This is all good if you observe parallel slopes

recall the ‘parallel slopes’ model from the Correlation and Regression DataCamp course:

https://newonlinecourses.science.psu.edu/stat502/node/187/

But what if your hypothesis states that some groups have it different than others? For example:

The general idea of moderation

Theory will tell which predictor is X and which is Z

Three coefficients to estimate

Interaction effect in simple words

How to check for interactions in a linear model?

Example 1: Does the relationship between Length and Width depend on Species? (iris data)

M_add <- Y ~ X + Z # additive model

M_int <- Y ~ X * Z # interaction model

anova(M_add, M_int) # compare

	Petal Length		Petal Length
Predictors	Estimates	p	Estimates	p
(Intercept)	1.18	0.019	-0.18	0.596
Sepal Width	0.08	0.576	0.48	<0.001
Species [versicolor]	0.75	0.284	3.11	<0.001
Species [virginica]	2.33	0.001	4.31	<0.001
Sepal Width × Species [versicolor]	0.76	0.001
Sepal Width × Species [virginica]	0.60	0.008
Observations	150		150
R² / R² adjusted	0.954 / 0.952		0.950 / 0.949

## Analysis of Variance Table
## 
## Model 1: Petal.Length ~ Sepal.Width * Species
## Model 2: Petal.Length ~ Sepal.Width + Species
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1    144 21.376                                
## 2    146 23.335 -2   -1.9587 6.5973 0.001814 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let’s explore interactions with a simulated dataset

##            y          x cond
## 1  3.5416709 -0.6264538    a
## 2  0.7706567  0.1836433    a
## 3  5.7647315 -0.8356286    a
## 4 -8.3073118  1.5952808    a
## 5 -3.9327744  0.3295078    a
## 6  6.6000035 -0.8204684    a

##        y                  x            cond   
##  Min.   :-11.8029   Min.   :-2.21470   a:100  
##  1st Qu.: -0.7279   1st Qu.:-0.61383   b:100  
##  Median :  3.8151   Median :-0.04937          
##  Mean   :  2.2873   Mean   : 0.03554          
##  3rd Qu.:  5.5384   3rd Qu.: 0.61300          
##  Max.   :  9.7033   Max.   : 2.40162

Let’s start with an additive model

## 
## Call:
## lm(formula = y ~ x + cond, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3161 -1.4148 -0.1941  1.3910  4.8887 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.1997     0.2091  -0.955    0.341    
## x            -2.8938     0.1595 -18.144   <2e-16 ***
## condb         5.1797     0.2956  17.521   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.084 on 197 degrees of freedom
## Multiple R-squared:  0.7781, Adjusted R-squared:  0.7759 
## F-statistic: 345.4 on 2 and 197 DF,  p-value: < 2.2e-16

Now, let’s estimate an interaction between cond and x

## 
## Call:
## lm(formula = y ~ x * cond, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.91411 -0.55945 -0.05044  0.64814  2.64157 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.02737    0.10250   0.267     0.79    
## x           -4.97883    0.11384 -43.734   <2e-16 ***
## condb        5.02195    0.14448  34.760   <2e-16 ***
## x:condb      3.91835    0.15607  25.107   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.017 on 196 degrees of freedom
## Multiple R-squared:  0.9474, Adjusted R-squared:  0.9466 
## F-statistic:  1176 on 3 and 196 DF,  p-value: < 2.2e-16

## Analysis of Variance Table
## 
## Model 1: y ~ x + cond
## Model 2: y ~ x * cond
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    197 855.42                                  
## 2    196 202.89  1    652.52 630.36 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now, let’s see on a plot where all the coefficients are

Now, your turn to draw an interaction

Draw an interaction plot for the following equation \[y = 2 + 1*x - 3*cond(b) + 1*x*cond(b)\] Can you tell, whether the relationship between x and y is stronger in group a or b?

Now, your turn to draw an interaction

Draw an interaction plot for the following equation \[y = 2 + 1*x - 3*cond(b) + 1*x*cond(b)\] Can you tell, whether the relationship between x and y is stronger in group a or b?

Red line for condition a and a blue one for b.

Example: Interaction example: continuous by continuous variable

Problem: Predict aggression by hours of playing violent video games and lack of empathy:

Some theories: the more kids play video games, the more aggressive they become.
Others: playing video games helps let the steam off and get relaxed, decreasing aggression.
Yet another theory: if a player is emotionally cold (callous), then playing games increases aggression. Who’s right?

Interaction: continuous by continuous variable: Interpretation

Interaction: continuous by continuous variable: R code

library(foreign)
df <- read.spss("Video Games.sav", to.data.frame = T, use.value.labels = T)
lmfit1 <- lm(Aggression ~ Vid_Games, data = df)
lmfit2 <- lm(Aggression ~ Vid_Games + CaUnTs, data = df)
lmfit3 <- lm(Aggression ~ Vid_Games * CaUnTs, data = df)
anova(lmfit1, lmfit2, lmfit3)

## Analysis of Variance Table
## 
## Model 1: Aggression ~ Vid_Games
## Model 2: Aggression ~ Vid_Games + CaUnTs
## Model 3: Aggression ~ Vid_Games * CaUnTs
##   Res.Df   RSS Df Sum of Sq       F    Pr(>F)    
## 1    440 68789                                   
## 2    439 45088  1   23700.2 238.129 < 2.2e-16 ***
## 3    438 43593  1    1495.7  15.028 0.0001221 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

tab_model(lmfit1, lmfit2, lmfit3)

	Aggression			Aggression			Aggression
Predictors	Estimates	CI	p	Estimates	CI	p	Estimates	CI	p
(Intercept)	34.84	30.99 – 38.70	<0.001	21.76	18.21 – 25.32	<0.001	33.12	26.38 – 39.86	<0.001
Vid Games	0.24	0.07 – 0.41	0.006	0.19	0.05 – 0.32	0.007	-0.33	-0.63 – -0.04	0.027
CaUnTs				0.76	0.66 – 0.86	<0.001	0.17	-0.15 – 0.49	0.295
Vid Games × CaUnTs							0.03	0.01 – 0.04	<0.001
Observations	442			442			442
R² / R² adjusted	0.017 / 0.015			0.356 / 0.353			0.377 / 0.373

library(interplot)
interplot(m = lmfit3, var1 = "Vid_Games", var2 = "CaUnTs")+ 
    xlab("CaUnTs") +
    ylab("Estimated Coefficient for Vid_Games")

plot_model(lmfit3, type = "int")

Can I add an interaction if both direct effects are insignificant?

Yes, if you have a theory behind it.

Let’s take a look at one example

##            y          x cond
## 1 -18.792843 -17.976378    a
## 2  -7.790157  -6.435724    a
## 3  -4.874391   4.676773    a
## 4  -5.218673   0.128643    a
## 5   9.837180  -2.170496    a
## 6   1.052459  13.151498    a

## 
## Call:
## lm(formula = y ~ x + cond, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.675 -10.606   0.136  10.384  30.461 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  2.57415    1.38290   1.861   0.0642 .
## x           -0.04272    0.10024  -0.426   0.6705  
## condb       -1.91426    1.95613  -0.979   0.3290  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.83 on 197 degrees of freedom
## Multiple R-squared:  0.005658,   Adjusted R-squared:  -0.004436 
## F-statistic: 0.5605 on 2 and 197 DF,  p-value: 0.5718

## 
## Call:
## lm(formula = y ~ x * cond, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -27.4897  -6.4649   0.3047   6.7771  25.2179 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.4345     0.9759   2.495   0.0134 *  
## x             1.0021     0.1023   9.792   <2e-16 ***
## condb        -2.0668     1.3804  -1.497   0.1360    
## x:condb      -2.0008     0.1416 -14.128   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.758 on 196 degrees of freedom
## Multiple R-squared:  0.5073, Adjusted R-squared:  0.4998 
## F-statistic: 67.28 on 3 and 196 DF,  p-value: < 2.2e-16

	y
Predictors	Estimates	CI	p
(Intercept)	2.43	0.51 – 4.36	0.013
x	1.00	0.80 – 1.20	<0.001
cond [b]	-2.07	-4.79 – 0.66	0.136
x × cond [b]	-2.00	-2.28 – -1.72	<0.001
Observations	200
R² / R² adjusted	0.507 / 0.500

Summary

Your hypotheses may lead you into testing the difference in slopes for different groups.
This is a situation when the relationship between a predictor and the outcome is moderated by the level of another predictor.
If the interaction model has a better model fit, interpret it. Even if the direct relationship is not stat.significant, the interaction effect may be stat.significant and theoretically important.
To interpret and deliver your results, predict the outcome for each group and plot the marginal (=pure) effects of predictors.

Where to practise all that?

ESS module on regression with interaction effects: http://essedunet.nsd.uib.no/cms/topics/regression/7/1.html
Andy Field’s textbook (Discovering Statistics Using R) excellent explanation with comments
Online collaborative tutorial book, UIC - good and clear: https://ademos.people.uic.edu/Chapter13.html
sjPlot tutorial: https://strengejacke.github.io/sjPlot/articles/plot_interactions.html

Lecture 7: Regression 2

In the previous lecture you learned that…

A brief recap: nuts and bolts of regression modelling

Today you will learn:

Many predictors

What is a good model?

Ways to compare your models

Model comparison with anova()

Nested models

An illustration of model comparison

An illustration of model comparison

The general idea of control variables

How controlling for a variable looks like

Why do we need control variables? An example

Any questions so far?

Additivity is an assumption of linear regression

This is all good if you observe parallel slopes

But what if your hypothesis states that some groups have it different than others? For example:

The general idea of moderation

Interaction effect in simple words

How to check for interactions in a linear model?

Let’s explore interactions with a simulated dataset

Now, let’s see on a plot where all the coefficients are

Now, let’s see on a plot where all the coefficients are

Now, let’s see on a plot where all the coefficients are

Now, let’s see on a plot where all the coefficients are

Now, let’s see on a plot where all the coefficients are

Now, your turn to draw an interaction

Now, your turn to draw an interaction

Example: Interaction example: continuous by continuous variable

Interaction: continuous by continuous variable: Interpretation

Interaction: continuous by continuous variable: Interpretation

Interaction: continuous by continuous variable: Interpretation

Interaction: continuous by continuous variable: R code

Can I add an interaction if both direct effects are insignificant?

Summary

Where to practise all that?

Questions?