Rmarkdown: Multiple Regression II

Introduction

Welcome to this R demo session! Today, we’ll delve deeper into regression models that include interaction terms. We’ll be using the readwrite2.csv dataset that contains students’ reading and writing test scores, as well as their proficiency levels. Our main question is: does the relationship between reading and writing scores differ across proficiency levels?

# Setting Up the Environment

# Before we start, let's load the necessary packages and the dataset we'll be working with.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
data_rw <- read.csv("readwrite2.csv")

Exploratory Data Analysis (Recap)

Last week, we noticed that the slopes between reading and writing scores are different across different proficiency levels. Here’s how you can visualize this trend:

ggplot(data_rw, aes(x=reading, y=writing, color=as.factor(level))) +
  geom_point(position = position_jitter(), alpha = .3, size = 2) +
  geom_smooth(method='lm') +
  labs(title="Reading vs Writing Scores by Proficiency Level")

## `geom_smooth()` using formula = 'y ~ x'

Interaction with One Continuous Variable and One Dichotomous Variable

To begin, let’s explore a regression model that includes a dichotomous variable, specifically focusing on proficiency levels 1 and 2 (Novice and Intermediate Low). Below is the regression model without any interaction terms.

data_rw1 <- data_rw %>%
  filter(level == 1 | level == 2) %>%
  mutate(level = case_when(level == 1 ~ 0,
                           TRUE ~ 1))

# Build the regression model without interaction
output1 <- lm(writing ~ reading + level, data = data_rw1)
summary(output1)

## 
## Call:
## lm(formula = writing ~ reading + level, data = data_rw1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7614 -1.5082 -0.0017  1.2750  5.5986 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.02763    0.44791  26.853  < 2e-16 ***
## reading     -0.12662    0.03197  -3.961 8.42e-05 ***
## level        3.27309    0.20205  16.199  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.137 on 565 degrees of freedom
## Multiple R-squared:  0.3253, Adjusted R-squared:  0.3229 
## F-statistic: 136.2 on 2 and 565 DF,  p-value: < 2.2e-16

Now let’s add the interaction.

output2 <- lm(writing ~ reading + level + reading*level, data = data_rw1)
summary(output2)

## 
## Call:
## lm(formula = writing ~ reading + level + reading * level, data = data_rw1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3056 -1.3777 -0.0165  1.4884  5.4787 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   10.18536    0.62295  16.350  < 2e-16 ***
## reading        0.01202    0.04568   0.263    0.792    
## level          7.12242    0.93966   7.580 1.43e-13 ***
## reading:level -0.26445    0.06309  -4.192 3.21e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.106 on 564 degrees of freedom
## Multiple R-squared:  0.3457, Adjusted R-squared:  0.3422 
## F-statistic: 99.34 on 3 and 564 DF,  p-value: < 2.2e-16

There is a shortcut- if you put in two variables with a * , lm interprets that as including all the main effects and the interaction. If there were three variables, such as reading * level * gender, all main effects and all 2-way interactions and the 3-way interaction would be included automatically.

output3 <- lm(writing ~ reading*level, data = data_rw1)
summary(output3)

## 
## Call:
## lm(formula = writing ~ reading * level, data = data_rw1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3056 -1.3777 -0.0165  1.4884  5.4787 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   10.18536    0.62295  16.350  < 2e-16 ***
## reading        0.01202    0.04568   0.263    0.792    
## level          7.12242    0.93966   7.580 1.43e-13 ***
## reading:level -0.26445    0.06309  -4.192 3.21e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.106 on 564 degrees of freedom
## Multiple R-squared:  0.3457, Adjusted R-squared:  0.3422 
## F-statistic: 99.34 on 3 and 564 DF,  p-value: < 2.2e-16

If you want the interaction but not the main effect, use a colon :. But including an interaction without a main effect usually makes no sense.

output4 <- lm(writing ~ reading:level, data = data_rw1)
summary(output4)

## 
## Call:
## lm(formula = writing ~ reading:level, data = data_rw1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7379 -1.6108  0.2003  1.3892  6.5130 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   10.61081    0.14820   71.60   <2e-16 ***
## reading:level  0.15635    0.01181   13.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.272 on 566 degrees of freedom
## Multiple R-squared:  0.2363, Adjusted R-squared:  0.235 
## F-statistic: 175.2 on 1 and 566 DF,  p-value: < 2.2e-16

We have the results, but what do they mean? Start with a plot with the interaction

ggplot(data_rw1, aes(x=reading, y=writing, color=as.factor(level))) +
  geom_point(position = position_jitter(), alpha = .3, size = 2) +
  geom_smooth(method='lm') +
  labs(title="Reading vs Writing Scores by Proficiency Level")

## `geom_smooth()` using formula = 'y ~ x'

Interpretation of the Model Output

Main Effects and Interactions

There is a significant main effect of reading scores (reading) on writing scores; higher values of reading are associated with higher values of writing.
There is a significant main effect of proficiency level (level); moving from one level to the next is associated with a substantial increase in writing scores.
There is a significant interaction effect between reading and proficiency level (reading:level).

Detailed Interpretation

The interaction term (reading:level) suggests that the relationship between reading and writing scores is not consistent across different proficiency levels. Specifically, the effect of reading scores on writing scores varies depending on the proficiency level of the individual.

For a more concrete understanding, let’s plug in some numbers:

Coefficient for the Categorical Variable:

The coefficient for the categorical variable (7.122) should be interpreted as the difference in writing scores between students at two different proficiency levels, specifically when their reading scores are zero.

For Proficiency Level = 0:

The main effect of reading is represented by its coefficient, 0.012 . In this case, for students whose proficiency level = 0, a one-unit increase in reading score is associated with a 0.012-unit increase in writing score.

For Proficiency Level = 1:

To understand the effect of reading when level is 1, we need to consider both the main effect of reading and the interaction term (reading:level = -0.264). So, the overall effect is 0.012+(-0.264) = -0.252. For individuals at proficiency level 1, a one-unit difference in reading is associated with a -0.252-unit difference in writing.

Thus, the model suggests that the impact of reading on writing is different for students at different proficiency level. While for novice students, the effect of reading scores on writing scores is negligible (0.012), for intermediate low students, the effect is negative (-0.252).

Interaction with Two Continuous Variables

Interactions between two continuous predictors are harder to understand but quite common and important. Let’s add an additional continuous variable, motivation to the dataset. Note that the motivation scores have been standardized.

existing_error_term <- residuals(lm(writing ~ reading, data=data_rw1))

data_rw1$motivation <- scale((data_rw1$writing - 0.3 * data_rw1$reading) / (0.2 * data_rw1$reading))

output5 <- lm(writing ~ reading * motivation, data = data_rw1)
summary(output5)

## 
## Call:
## lm(formula = writing ~ reading * motivation, data = data_rw1)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -5.049e-13  1.700e-16  5.700e-16  1.240e-15  1.808e-13 
## 
## Coefficients:
##                     Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)        2.408e-14  6.330e-15 3.805e+00 0.000157 ***
## reading            8.564e-01  4.277e-16 2.002e+15  < 2e-16 ***
## motivation         5.505e-15  1.769e-15 3.112e+00 0.001952 ** 
## reading:motivation 3.310e-01  1.766e-16 1.874e+15  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.296e-14 on 564 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.419e+30 on 3 and 564 DF,  p-value: < 2.2e-16

Interpretation of the Model Output

Main Effects and Interactions

The main effect of reading (reading) on writing is not significant, with a higher reading score associated with a higher writing score indicated by the positive coefficient.
The main effect of motivation (motivation) on writing is not significant at the conventional levels.
A significant interaction effect exists between reading and motivation.

Detailed Interpretation

In the context of interaction effects, it’s often useful to understand how the impact of one variable changes at different levels of another. Considering the effect at the mean and one standard deviation above and below the mean of the moderating variable (motivation in this case) provides a nuanced understanding. This approach gives us a sense of how reading impacts writing for individuals who are at average levels of motivation, as well as for those who are exceptionally motivated or less so.

At Mean Level of Motivation

For motivation at its mean value, which is zero due to standardization, the interaction term will have no effect (motivation * reading = 0). Therefore, the regression coefficient for reading is its main effect, 0.856. This means that when motivation is at the mean, a one-unit increase in reading is associated with a 0.856-unit increase in writing.

Interpreting at Specific Values: Mean + 1SD and Mean - 1SD

For motivation at one standard deviation above the mean (which would be 1 in a standardized variable), the effect of the interaction term would be 0.331. Therefore, the overall effect of reading on writing is 0.856+0.331=1.187. At one standard deviation above the mean in motivation, a one-unit increase in reading would result in a 1.187-unit increase in writing.

Conversely, for motivation at one standard deviation below the mean (which would be -1 in a standardized variable), the interaction term would subtract from the main effect, 0.856−0.331=0.525 At one standard deviation below the mean in motivation, a one-unit increase in reading would result in a 0.525-unit increase in writing.

The key point in all of this: the effect of reading scores on writing scores depends on the level of motivation This is the standard way to understand an interaction.

Interactions can work the other way too. Instead of the effect of reading scores on writing scores depending on the level of motivation, we can consider the effect of motivation on writing scores depending on the level of reading scores This is completely mathematically equivalent. Pick the one that offers the best scientific interpretation.

The package interactions offers a lot of ways to better understand interactions.

# install.packages("interactions")
library(interactions)

interact_plot(output5,pred = reading,modx = motivation)

# pred: The name of the predictor variable involved in the interaction.

# modx: The name of the moderator variable involved in the interaction. 

interact_plot(output5,pred = reading,modx = motivation,plot.points = TRUE)

Looking at specific values of one variable in the context of an interaction is a form of what is called simple slopes analysis. The sim_slopes function in the interactions package does this automatically.

sim_slopes(output5,pred = reading,modx = motivation,johnson_neyman = FALSE) # omit the Johnson-Neyman regions of significance.

## Warning in summary.lm(object, ...): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be
## unreliable

## SIMPLE SLOPES ANALYSIS
## 
## Slope of reading when motivation = -1.00000e+00 (- 1 SD): 
## 
##   Est.   S.E.                t val.      p
## ------ ------ --------------------- ------
##   0.53   0.00   6433019635787841.00   0.00
## 
## Slope of reading when motivation = -1.07074e-15 (Mean): 
## 
##   Est.   S.E.                t val.      p
## ------ ------ --------------------- ------
##   0.86   0.00   1957938854142108.00   0.00
## 
## Slope of reading when motivation =  1.00000e+00 (+ 1 SD): 
## 
##   Est.   S.E.                t val.      p
## ------ ------ --------------------- ------
##   1.19   0.00   1210467956874302.00   0.00

You may observe that the mean and SD for motivation are exactly the same as what you had above.

The sim_slopes function also calculated the statistical significance of the coefficient on reading scores for specific values of motivation Is that awesome? Maybe. Maybe not. You can decide if it’s a valuable way to help you interpret an interaction.

The default in the sim_slopes function is exactly what we did: mean, mean+1SD, mean-1SD. We can also control specific values of motivation to use.

sim_slopes(output5,pred = reading,modx = motivation,
           johnson_neyman = TRUE)

## Warning in summary.lm(object, ...): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be
## unreliable

## JOHNSON-NEYMAN INTERVAL
## 
## When motivation is OUTSIDE the interval [-2.59, -2.59], the slope of
## reading is p < .05.
## 
## Note: The range of observed values of motivation is [-1.58, 14.03]
## 
## SIMPLE SLOPES ANALYSIS
## 
## Slope of reading when motivation = -1.00000e+00 (- 1 SD): 
## 
##   Est.   S.E.                t val.      p
## ------ ------ --------------------- ------
##   0.53   0.00   6433019635787841.00   0.00
## 
## Slope of reading when motivation = -1.07074e-15 (Mean): 
## 
##   Est.   S.E.                t val.      p
## ------ ------ --------------------- ------
##   0.86   0.00   1957938854142108.00   0.00
## 
## Slope of reading when motivation =  1.00000e+00 (+ 1 SD): 
## 
##   Est.   S.E.                t val.      p
## ------ ------ --------------------- ------
##   1.19   0.00   1210467956874302.00   0.00

If the interaction term is not significant, you can also ask for Johnson-Neyman interval, which are the values for movitation for which the regression coefficient of reading scores (reading) is significant (in our case this is not relevant because the interaction term is already significant).

I don’t love this analysis because it puts too much emphasis on statistical significance, but here it is.

Rmarkdown: Multiple Regression II

Wenyue Ma

Introduction

Exploratory Data Analysis (Recap)

Interaction with One Continuous Variable and One Dichotomous Variable

Interpretation of the Model Output

Interaction with Two Continuous Variables

Interpretation of the Model Output