Olesya Volchenko and Anna Shirokanova
April 17, 2023
Linear regression serves to estimate the relationships between a set of predictors and a linear outcome variable
Regression coefficients quantify the direction and strength of these relationships
Correctness of regression coefficients relies on a number of assumptions (linearity, normally distributed, homoscedastic residuals, etc.)
The major advantage of regression modelling is its ability to account for many predictors.
How many predictors can we add to one regression model?
Let’s say we have 2 models. One contains only x1 as a predictor and another one contains both x1 and x2.
Therefore model comparison will show us whether adding x2 as a predictor improves the model.
| y | y | |||
|---|---|---|---|---|
| Predictors | Estimates | p | Estimates | p |
| (Intercept) | -0.09 | 0.522 | -0.14 | 0.127 |
| x1 | 0.79 | <0.001 | 0.96 | <0.001 |
| x2 | 1.05 | <0.001 | ||
| Observations | 100 | 100 | ||
| R2 / R2 adjusted | 0.210 / 0.202 | 0.679 / 0.672 | ||
## Analysis of Variance Table
##
## Model 1: y ~ x1
## Model 2: y ~ x1 + x2
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 98 208.624
## 2 97 84.722 1 123.9 141.86 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Read more on comparing models with anova here: https://bookdown.org/ndphillips/YaRrr/comparing-regression-models-with-anova.html
anova is only helpfull when you are comparing nested models.
Models considered to be nested if a simpler model can be obtained by removing predictors from a more complex model.
For example,
Model 1: y = a + b
Model 2: y = a + b + c
Model 3: y = a + c
Models 1 and 2 are nested, Models 3 and 2 are nested, Models 1 and 3 are not nested.
In experiments we can randomize and control conditions.
But it is not true for observational studies. -> Therefore we need to control for a set of variables (usually socio-demographics) in order to take those uncontrolled differences into account and be able to tell whether the predictor of interest is indeed related to the outcome.
Example: happiness and Internet use
Source: Blavatskyy, P. (2021). Obesity of politicians and corruption in post‐Soviet countries. Economics of Transition and Institutional Change, 29(2), 343-356.
When adding new predictors to the model, we imply they have independent relationship with the outcome.
In practice, the coefficient between X and Y can depend on the level of Z.
Z is called a ‘moderator’ variable, while the situation is called a ‘moderated relationship.’
In experimental research, it is also called the interaction effect.
How to check: Try models and choose a better one: y~x+z or y~x*z
recall the ‘parallel slopes’ model from the Correlation and Regression DataCamp course:
Theory will tell which predictor is X and which is Z
Three coefficients to estimate
Example 1: Does the relationship between Length and Width depend on Species? (iris data)
M_add <- Y ~ X + Z # additive model
M_int <- Y ~ X * Z # interaction model
anova(M_add, M_int) # compare
| Petal Length | Petal Length | |||
|---|---|---|---|---|
| Predictors | Estimates | p | Estimates | p |
| (Intercept) | 1.18 | 0.019 | -0.18 | 0.596 |
| Sepal Width | 0.08 | 0.576 | 0.48 | <0.001 |
| Species [versicolor] | 0.75 | 0.284 | 3.11 | <0.001 |
| Species [virginica] | 2.33 | 0.001 | 4.31 | <0.001 |
|
Sepal Width × Species [versicolor] |
0.76 | 0.001 | ||
|
Sepal Width × Species [virginica] |
0.60 | 0.008 | ||
| Observations | 150 | 150 | ||
| R2 / R2 adjusted | 0.954 / 0.952 | 0.950 / 0.949 | ||
## Analysis of Variance Table
##
## Model 1: Petal.Length ~ Sepal.Width * Species
## Model 2: Petal.Length ~ Sepal.Width + Species
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 144 21.376
## 2 146 23.335 -2 -1.9587 6.5973 0.001814 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## y x cond
## 1 3.5416709 -0.6264538 a
## 2 0.7706567 0.1836433 a
## 3 5.7647315 -0.8356286 a
## 4 -8.3073118 1.5952808 a
## 5 -3.9327744 0.3295078 a
## 6 6.6000035 -0.8204684 a
## y x cond
## Min. :-11.8029 Min. :-2.21470 a:100
## 1st Qu.: -0.7279 1st Qu.:-0.61383 b:100
## Median : 3.8151 Median :-0.04937
## Mean : 2.2873 Mean : 0.03554
## 3rd Qu.: 5.5384 3rd Qu.: 0.61300
## Max. : 9.7033 Max. : 2.40162
Let’s start with an additive model
##
## Call:
## lm(formula = y ~ x + cond, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3161 -1.4148 -0.1941 1.3910 4.8887
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1997 0.2091 -0.955 0.341
## x -2.8938 0.1595 -18.144 <2e-16 ***
## condb 5.1797 0.2956 17.521 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.084 on 197 degrees of freedom
## Multiple R-squared: 0.7781, Adjusted R-squared: 0.7759
## F-statistic: 345.4 on 2 and 197 DF, p-value: < 2.2e-16
Now, let’s estimate an interaction between cond and x
##
## Call:
## lm(formula = y ~ x * cond, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.91411 -0.55945 -0.05044 0.64814 2.64157
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.02737 0.10250 0.267 0.79
## x -4.97883 0.11384 -43.734 <2e-16 ***
## condb 5.02195 0.14448 34.760 <2e-16 ***
## x:condb 3.91835 0.15607 25.107 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.017 on 196 degrees of freedom
## Multiple R-squared: 0.9474, Adjusted R-squared: 0.9466
## F-statistic: 1176 on 3 and 196 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Model 1: y ~ x + cond
## Model 2: y ~ x * cond
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 197 855.42
## 2 196 202.89 1 652.52 630.36 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Draw an interaction plot for the following equation \[y = 2 + 1*x - 3*cond(b) + 1*x*cond(b)\]
Can you tell, whether the relationship between x and y is stronger in
group a or b?
Draw an interaction plot for the following equation \[y = 2 + 1*x - 3*cond(b) + 1*x*cond(b)\]
Can you tell, whether the relationship between x and y is stronger in
group a or b?
Red line for condition a and a blue one for b.
Problem: Predict aggression by hours of playing violent video games and lack of empathy:
library(foreign)
df <- read.spss("Video Games.sav", to.data.frame = T, use.value.labels = T)
lmfit1 <- lm(Aggression ~ Vid_Games, data = df)
lmfit2 <- lm(Aggression ~ Vid_Games + CaUnTs, data = df)
lmfit3 <- lm(Aggression ~ Vid_Games * CaUnTs, data = df)
anova(lmfit1, lmfit2, lmfit3)## Analysis of Variance Table
##
## Model 1: Aggression ~ Vid_Games
## Model 2: Aggression ~ Vid_Games + CaUnTs
## Model 3: Aggression ~ Vid_Games * CaUnTs
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 440 68789
## 2 439 45088 1 23700.2 238.129 < 2.2e-16 ***
## 3 438 43593 1 1495.7 15.028 0.0001221 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
| Aggression | Aggression | Aggression | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 34.84 | 30.99 – 38.70 | <0.001 | 21.76 | 18.21 – 25.32 | <0.001 | 33.12 | 26.38 – 39.86 | <0.001 |
| Vid Games | 0.24 | 0.07 – 0.41 | 0.006 | 0.19 | 0.05 – 0.32 | 0.007 | -0.33 | -0.63 – -0.04 | 0.027 |
| CaUnTs | 0.76 | 0.66 – 0.86 | <0.001 | 0.17 | -0.15 – 0.49 | 0.295 | |||
| Vid Games × CaUnTs | 0.03 | 0.01 – 0.04 | <0.001 | ||||||
| Observations | 442 | 442 | 442 | ||||||
| R2 / R2 adjusted | 0.017 / 0.015 | 0.356 / 0.353 | 0.377 / 0.373 | ||||||
library(interplot)
interplot(m = lmfit3, var1 = "Vid_Games", var2 = "CaUnTs")+
xlab("CaUnTs") +
ylab("Estimated Coefficient for Vid_Games")Yes, if you have a theory behind it.
Let’s take a look at one example
## y x cond
## 1 -18.792843 -17.976378 a
## 2 -7.790157 -6.435724 a
## 3 -4.874391 4.676773 a
## 4 -5.218673 0.128643 a
## 5 9.837180 -2.170496 a
## 6 1.052459 13.151498 a
##
## Call:
## lm(formula = y ~ x + cond, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.675 -10.606 0.136 10.384 30.461
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.57415 1.38290 1.861 0.0642 .
## x -0.04272 0.10024 -0.426 0.6705
## condb -1.91426 1.95613 -0.979 0.3290
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.83 on 197 degrees of freedom
## Multiple R-squared: 0.005658, Adjusted R-squared: -0.004436
## F-statistic: 0.5605 on 2 and 197 DF, p-value: 0.5718
##
## Call:
## lm(formula = y ~ x * cond, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.4897 -6.4649 0.3047 6.7771 25.2179
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4345 0.9759 2.495 0.0134 *
## x 1.0021 0.1023 9.792 <2e-16 ***
## condb -2.0668 1.3804 -1.497 0.1360
## x:condb -2.0008 0.1416 -14.128 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.758 on 196 degrees of freedom
## Multiple R-squared: 0.5073, Adjusted R-squared: 0.4998
## F-statistic: 67.28 on 3 and 196 DF, p-value: < 2.2e-16
| y | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 2.43 | 0.51 – 4.36 | 0.013 |
| x | 1.00 | 0.80 – 1.20 | <0.001 |
| cond [b] | -2.07 | -4.79 – 0.66 | 0.136 |
| x × cond [b] | -2.00 | -2.28 – -1.72 | <0.001 |
| Observations | 200 | ||
| R2 / R2 adjusted | 0.507 / 0.500 | ||