Online Tutorial: UCLAstats
Readings Recommended by Lee for this script:
Set working directory.
setwd("/Users/Danielle/Desktop/working_directory/lab_3")
Run the following libraries. If not yet installed, use, for example, install.packages("package_name")
library("car")
library("data.table")
library("tidyverse")
library("wesanderson")
library("jtools")
library("ggplot2")
library("ggiraphExtra")
library("emmeans")
library("corrgram")
library("Matrix")
In last week’s lab, we explored linear models that contained main effects. In other words, we focused on models that evaluated the distinct impact of two predictor variables on the response. However, it may be the case that the magnitude of effect of a predictor variable (x1) on the response y, is modified by a second predictor variable (x2;some grouping factor). For example, a large difference in y may be observed between two levels of x2 at lower values of x1 but that difference disappears at higher values of x1. Interaction terms explore whether the effect of one predictor on a response variable (Y~X1) changes for different levels of predictor variable (X2). Visually, interactions can be illustrated in many ways but examples frequently draw from non-parallel lines on xy plots (e.g. intersecting lines).
Fig.1 Different Illustrations of an Interaction
What exactly are we trying to evaluate when we ask “Is there an interaction?”
With categorical predictor variables, as is the case with ANOVA, we are asking “How does one factor affect the”performance" of the other factor in explaining variance in the response y?". Review:
Similarly, in linear regression where at least one predictor is continuous, we are asking: “How does the effect of the continuous predictor variable on the response differ across levels of the other [continuous or categorical] predictor?”.
Note in this lab, we will focus on effect sizes when interpreting the outputs- not null hypothesis testing (e.g. p-values/significance testing). The first section focuses on interpreting the coeff output. In other words, understanding the effect sizes as they relate to linear models with interactions. In the second section, you will manipulate interactions through simulations.
Note: This section covers ONLY multiplicative Interactions. We will explore non-multiplicative interactions using simulated data in Section 2.
We will use the Salaries data set for the example. Salaries is built into R and has the salary for professors of different rank (associate professor, assistant professor and professor), from different disciplines (two levels: A and B) and binary gender data (male, female)- among other variables. Below we explore whether sex modifies the relationship between salary and discipline. In other words, maybe males make higher salary in one discipline and females in the other.
mod<-lm(salary~ discipline* sex, data= Salaries)
coef(lm(salary~ discipline* sex, data= Salaries))
## (Intercept) disciplineB sexMale disciplineB:sexMale
## 89064.94 22169.58 21635.04 -14109.19
ggplot(Salaries,aes(x=discipline,y=salary,fill=sex))+
geom_boxplot()
Remember, coeff returns the effect sizes (aka.differences in means). The interpretation of coef is as follows.
Interceptor β0: is the mean salary for females in discipline A. Females in discipline A make on average $89,064.94 (refer eq. A below)disciplineB or β1: is the effect (or the change in mean salary) for females in discipline B. On average, females make 22169.58 more dollars in discipline B than they do in discipline A. Mean salary for females in discipline B is $111,234.52. (89,064.94 + 22,169.58). (refer eq. B below)sexMale or β2: is the effect (or the change in mean salary) for males in discipline A. Males in discipline A make on average 21,635.04 more dollars than females in discipline A. Average salary for males in discipline A is $110,699.98.(89,064.94 + 21,635.04) (refer eq. C below)disciplineB:sexMale or β3: Change in discipline effect if sex is male OR the change in the sex effect if the individual teaches discipline is B. The mean salary for males in discipline B can be calculated by adding all the coefficients together. (89064.94 + 22169.58 + 21635.04 -14109.19)(refer eq. D below)If those interpretations may make perfect sense to you I’d skip ahead to the next section. I like to remind myself what is happening under the hood to get those outputs and find it a helpful exercise, especially if you are newer to stats, to write out the mathematical formula. The way R codes a model can sometimes detach us from the mathematical model.
First a quick review of model components:
Now we will create a regression equation that solves for each of the conclusions above.
Calculating Intercept/ Salary for females in discipline A:
Remember, use of the lm() function with two categorical variables is the same as performing a 2x2 ANOVA. In cases with categorical variables, R dummy codes the variables and the structure of the regression formula. Here are the dummy codes relevant to our model above:
discipline: A= 0; B=1sex: female=0; male=1How do I know that is the way R dummy codes these variables? R assigns 1 to the first level it encounters for a given factor. Also, the coef heading disciplineB and sexMale lets you know those two levels were assigned dummy code = 1 for each of those variables.
Let’s input the coefficients, and dummy code accordingly. If we are interested in Salary for females in discipline A then this would imply the dummy code for discipline=0 and sexM=0. All terms cancel with the exception of the intercept. Yay! It makes sense the intercept represents the mean salary for females in discipline A.
Calculating salary for females in discipline B:
If we are interested in Salary for females in discipline B then this would imply the dummy code for discipline=1 and sexM=0. Only the sex term and interaction term cancel this time. Hence, we add the effect of discipline to the intercept.
Calculating salary for males in discipline A:
Calculating salary for males in discipline B:
In summary, the effect of gender on salary is modified by the discipline to which the professors are a part of. Male professors generally make higher salary than female professors (ugh) and this disparity is greater for discipline A.
Get data.
autompg = read.table(
"http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data",
quote = "\"",
comment.char = "",
stringsAsFactors = FALSE)
# give the dataframe headers
colnames(autompg) = c("mpg", "cyl", "disp", "hp", "wt", "acc", "year", "origin", "name")
# remove missing data, which is stored as "?"
autompg = subset(autompg, autompg$hp != "?")
# remove the plymouth reliant, as it causes some issues
autompg = subset(autompg, autompg$name != "plymouth reliant")
# give the dataset row names, based on the engine, year and name
rownames(autompg) = paste(autompg$cyl, "cylinder", autompg$year, autompg$name)
# remove the variable for name
autompg = subset(autompg, select = c("mpg", "cyl", "disp", "hp", "wt", "acc", "year", "origin"))
# change horsepower from character to numeric
autompg$hp = as.numeric(autompg$hp)
# create a dummy variable for foreign vs domestic cars. domestic = 1.
autompg$domestic = as.numeric(autompg$origin == 1)
# remove 3 and 5 cylinder cars (which are very rare.)
autompg = autompg[autompg$cyl != 5,]
autompg = autompg[autompg$cyl != 3,]
# the following line would verify the remaining cylinder possibilities are 4, 6, 8
#unique(autompg$cyl)
# change cyl to a factor variable
autompg$cyl = as.factor(autompg$cyl)
In autompg:
mpg: miles per gallon (continuous dependent variable)disp: engine displacement (continuous predictor variable)domestic: domestic/foreign (categorical predictors variable)#There are two ways to code the same interaction in R using the lm() function. One uses the `*` operator and the other the `:` operator .
mod1<- lm(mpg ~ disp * domestic, data = autompg)
mod2<- lm(mpg ~ disp + domestic + disp:domestic, data = autompg)
#Yay! outputs yield same results-just as we'd expect.
summary(mod1)
##
## Call:
## lm(formula = mpg ~ disp * domestic, data = autompg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.8332 -2.8956 -0.8332 2.2828 18.7749
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.05484 1.80582 25.504 < 2e-16 ***
## disp -0.15692 0.01668 -9.407 < 2e-16 ***
## domestic -12.57547 1.95644 -6.428 3.90e-10 ***
## disp:domestic 0.10252 0.01692 6.060 3.29e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.308 on 379 degrees of freedom
## Multiple R-squared: 0.7011, Adjusted R-squared: 0.6987
## F-statistic: 296.3 on 3 and 379 DF, p-value: < 2.2e-16
summary(mod2)
##
## Call:
## lm(formula = mpg ~ disp + domestic + disp:domestic, data = autompg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.8332 -2.8956 -0.8332 2.2828 18.7749
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.05484 1.80582 25.504 < 2e-16 ***
## disp -0.15692 0.01668 -9.407 < 2e-16 ***
## domestic -12.57547 1.95644 -6.428 3.90e-10 ***
## disp:domestic 0.10252 0.01692 6.060 3.29e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.308 on 379 degrees of freedom
## Multiple R-squared: 0.7011, Adjusted R-squared: 0.6987
## F-statistic: 296.3 on 3 and 379 DF, p-value: < 2.2e-16
Let’s look at this visually:
#Define intercepts.
int_for = coef(mod1)[1]
int_dom = coef(mod1)[1] + coef(mod1)[3] #Consider the formula above that we worked out by hand. The calculation for int_dom should make sense now.
#Define slopes
slope_for = coef(mod1)[2]
slope_dom = coef(mod1)[2] + coef(mod1)[4]#Consider the formula above that we worked out by hand. The calculation for slope_dom should make sense now.
#plot the coefficients and slopes using base R functions
plot(mpg ~ disp, data = autompg, col = domestic + 1, pch = domestic + 1, xlim = c(0, 500), ylim = c(0, 50))
abline(int_for, slope_for, col = 1, lty = 1, lwd = 2) # line for foreign cars
abline(int_dom, slope_dom, col = 2, lty = 2, lwd = 2) # line for domestic cars
legend("topright", c("Foreign", "Domestic"), pch = c(1, 2), col = c(1, 2))
Again, since we are emphasizing effect sizes (not significance testing) let’s just look at the coef
coef(mod1)
## (Intercept) disp domestic disp:domestic
## 46.0548423 -0.1569239 -12.5754714 0.1025184
Interpreting Model Outputs:
Remember from the ANOVA example, we cannot interpret the main effects without respect to a given level of the other variable. disp=-0.16 is not interpreted as an overall effect on mpg. Instead, to interpret disp=-0.16 we must take into account the two levels of domestic.
Intercept: Mean mpg of foreign cars with 0 dispdisp: effect of disp on mpg for foreign cars.domestic: the effect of being a domestic car on mean mpgdisp:domestic: Change in disp effect on mpg if origin of car is domestic OR the change in disp effect on mpg if the car origin is foriegn.Means and slopes for each group:
mpg of foreign cars with 0 disp= 46.05mpgdisp average fuel efficiency (mpg) decreases by 0.16.mpg of domestic cars when disp=0 is 33.48mpgmpg) is a reduction of 0.06 for every one unit change in engine power (disp).Skip ahead if you can interpret the coeff output easily and those interpretations make perfect sense. We will now write-out the model like before and show how we came to the conclusions above.
Let’s interpret coefficients by implementing dummy coding.
domestic: 0= foreign; 1= domestic.Let’s first consider the distinct effect of engine displacement disp on mpg for foreign cars. To do so, we dummy code domestic with 0. Solving for the equation reduces it to mpg=46.05+(-0.16*disp). We can now interpret the coefficient estimate for disp in the summary output above as: In foreign cars, the estimated change in average fuel efficiency (mpg) is a reduction of 0.16 for every one unit change in engine power. The mean mpg for foreign cars with no displacement is 46.05.
Now let’s solve for the distinct effect of engine displacement
disp on mpg for domestic cars. To do so, we dummy code domestic with 1. To solve for the equation, you will see we end up doing the following: β1+β3=-0.06 and β0+β2=33.48. We can now interpret the coefficient estimate for disp in the summary output above as: In domestic cars, the estimated change in average fuel efficiency (mpg) is a reduction of 0.06 for every one unit change in engine power. We also see that the intercept for domestic cars is 33.48.
knitr::include_graphics("model_formula_5.jpeg")
Summary: Foreign cars have a stronger negative relationship between engine power and miles per gallon than domestic cars. At lower values of displacement (engine power), foreign cars have greater fuel efficiency than domestic cars. However, as displacement increases (engine power increases above >120), domestic cars have greater fuel efficiency.
Review: Video 1
Check for Understanding
cyl: lm(mpg~disp*cyl).Now let’s consider two continuous predictor variables (disp and hp, engine displacement and horsepower) and their effect on miles per gallon.
# hp= horsepower
mod3 <- lm(mpg~ disp * hp, data = autompg)
coef(mod3)
## (Intercept) disp hp disp:hp
## 52.4081997848 -0.1001737655 -0.2198199720 0.0005658269
As we did before, let’s write out the equation (eq. a). First, we we can rearrange terms by factoring out x1 (eq.b). Once inputting the coef (e. c)we see thata one unit increase in x1 (disp), the mean of Y (mpg) increases by the term β1+β3x2. This indicates that the increase in mean Y for each unit increase in x1, will differ across values of x2 (hp) (eq. d).
Let’s solve for x2 (eq. e-g). For a one unit increase in x2 (hp), the mean of Y (mpg) increases by the term β2+β3x1 (eq. g). This indicates that the increase in mean Y for each unit increase in x2, will differ across values of x1 (disp).
To plot this we can bin data and you will see an example of this later in this script using simulated data where values are grouped by quantiles. For now, let’s interpret the coeff.
coef(mod3)
## (Intercept) disp hp disp:hp
## 52.4081997848 -0.1001737655 -0.2198199720 0.0005658269
mpg for a car with 0 disp and 0 hpmpg for an increase in 1 disp, for a car with 0 hp. When hp increases to any value other than zero, then the influence of disp on mpg increase by 0.0005 for every level of hp.mpg for an increase in 1 hp, for a car with 0 disp. When disp takes on values above zero, then the influence of hp on mpg increases by 0.005 for every level of disp.mpg for and increase in disp, for a car of a certain hp (or visa versa)Need Review?Video1
You may be asking now, there’s got to be an automated way to get those slopes for interactions? Of course there is! But our approach above hopefully encouraged a better understanding. Use emtrends(mod1, ~ domestic, var="disp") fo a quick output of the slopes for each level of the modifier variable.
This is a function that will come in handy later. It takes a variable in data and breaks into a specified number of quantiles n_breaks. It adds a column break_num to data which is the quantile to which each observation belongs.
split_data <- function(data, n_breaks, variable) {
breaks <- seq(0, 1, by = 1/n_breaks)
data_output <- data[ , variable]
l <- list()
for (i in 1:length(breaks)) {
if (i == n_breaks + 1) {}
else {
temp <- subset(data, data_output > quantile(data_output, breaks[i]) & data_output <= quantile(data_output, breaks[i+1]))
temp$break_num <- i
l[[i]] <- temp
}
}
fin_dat <- rbindlist(l)
return(fin_dat)
}
Main Effects Model- No Interactions
Before we examine interactions let’s start with a dataset with no interactions-only additive effects. The code below creates that data frame. It is important to remember that we are simulating data, therefore, we are defining what predicts the response variable (y). This is in contrast to when you have data in hand and you fit a model to find what predicts y. Here we are defining y as y <- 5*x1 + -2 * x2 + rnorm(1000, 0, 2). Which is to say we are simulating y to be dependent on two main effects: x1 and x2 with an error term (noise) that is normally distributed. Notice that by making our errors normally distributed we are “forcing” our model to meet the assumtpitons of linear models.
set.seed(335)
# simulate to uncorrelated predictor variables: x1 and x2. If we wanted to simulate correlated variables, we # would so something like this: x2 <- x1 + rnorm(1000, 0, 0.2)
x1 <- rnorm(1000, 0, 1)
x2 <- rnorm(1000, 0, 1)
#generate a response variable that is predicted by x1 and x2, with noise.
y <- 5*x1 + -2 * x2 + rnorm(1000, 0, 2)
#create the dataset
dat <- data.frame(x1 = x1,
x2 = x2,
y = y)
To look for interactions in each data set, we will examine whether the relationship between x1 and y changes for different subsets of the data. Since x2 is currently a continuous variable, we will make it categorical using the split_data function. This breaks x2 into 6 parts (because we specified n_breaks=6). The new column break_num is created in a data frame slopes_plot. A break_num = 1 represents the lowest values of x2, while break_num=6 represents the highest values of x2. Here we break the data into 6 sections.
slopes_plot <- split_data(data = dat, n_breaks = 6, variable = "x2")
Let’s plot the data. We see that the relationship between x1 and y is independent of x2. We know this because the slope of the line does not change whether looking exclusively at the low values of x2 or the high values of x2. No interactions here.
ggplot(data = slopes_plot) +
geom_point(aes(x1, y, fill = factor(break_num)), pch = 21, color = "black") +
geom_smooth(aes(x1, y, group = factor(break_num)), color = "black", se = F, method = "lm", size = 1.5) +
geom_smooth(aes(x1, y, color = factor(break_num)), se = F, method = "lm", size = 1) +
scale_fill_manual(values = wes_palette("Zissou1", length(unique(slopes_plot$break_num)), type = "continuous")) +
scale_color_manual(values = wes_palette("Zissou1", length(unique(slopes_plot$break_num)), type = "continuous")) +
labs(fill = "x2 quantile", color = "x2 quantile") +
theme_classic()
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Let’s view this relationship another way by plotting the slope estimates for each x2 quantile. Note here we have n_breaks=20 rather than n_breaks =6 from above. The latter would only leave us with 6 points on the graph.
# `coef_plot_data` is a data frame of slope estimates (aka. effect sizes or coefficients) for the effect of X1 on Y for each group (here, quantile of X2).
coef_plot_data <- split_data(data = dat, n_breaks = 20, variable = "x2")
l <- list()
for (i in 1:20) {
temp_dat <- subset(coef_plot_data, break_num == i)
co <- coef(glm(y ~ x1, data = temp_dat))[[2]]
l[[i]] <- data.frame(x1 = co, x2 = i)
}
coef_plot_data <- rbindlist(l)
ggplot(data = coef_plot_data) +
geom_point(aes(x1, x2), pch = 21, fill = "gray65", color = "black") +
labs(x = "Effect size of x1 on y", y = "Quantile of x2") +
theme_classic()
Here we see how the slope estimate of the line varies by subset of the data. Again, no pattern here. In other words, there is no interaction among x1 and x2. The effect of x1 on y is not modified by x2. As we simulated, these predictor variables are independent of each other.
While the plot above shows compelling visual evidence for no relationship between x1 and x2, let’s verify using the effect size from a linear model. Effect size basically zero (x1= 0.01 +/-0.03).
summary(glm(x2 ~ x1, data = coef_plot_data))
##
## Call:
## glm(formula = x2 ~ x1, data = coef_plot_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -9.5378 -4.7934 0.0071 4.6828 9.5506
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.7993 20.6436 0.475 0.641
## x1 0.1394 4.0981 0.034 0.973
##
## (Dispersion parameter for gaussian family taken to be 36.94207)
##
## Null deviance: 665.00 on 19 degrees of freedom
## Residual deviance: 664.96 on 18 degrees of freedom
## AIC: 132.84
##
## Number of Fisher Scoring iterations: 2
Model With Interaction Effect
Now we add a multiplicative interaction, with still uncorrelated predictors. Same steps as above, but you will note the interaction added to the model in the term: 2*x1*x2
x1 <- rnorm(1000, 0, 1)
x2 <- rnorm(1000, 0, 1)
y <- 5*x1 + 2 * x2 + 2*x1*x2 + rnorm(1000, 0, 2) #here you can see the interaction (2*x1*x2)
dat <- data.frame(x1 = x1,
x2 = x2,
y = y)
slopes_plot <- split_data(data = dat, n_breaks = 6, variable = "x2")
#Now we see an interaction. As the values of x2 increase, so does the strength
#of the relationship between x1 and y.
ggplot(data = slopes_plot) +
geom_point(aes(x1, y, fill = factor(break_num)), pch = 21, color = "black") +
geom_smooth(aes(x1, y, group = factor(break_num)), color = "black", se = F, method = "lm", size = 1.5) +
geom_smooth(aes(x1, y, color = factor(break_num)), se = F, method = "lm", size = 1) +
scale_fill_manual(values = wes_palette("Zissou1", length(unique(slopes_plot$break_num)), type = "continuous")) +
scale_color_manual(values = wes_palette("Zissou1", length(unique(slopes_plot$break_num)), type = "continuous")) +
labs(fill = "x2", color = "x2") +
theme_classic()
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
We can see here that for different levels of
x2, the effect (or slope) of x1 on y differs.
Now, when we plot the slopes for different quantiles of x2, we should expect to see some relationship. A pattern indicates that the effect of one variable on the response will vary depending on the value of the other predictor.
#Now this coefficient plot might be more interesting. Let's examine the relationship
#between the values of x2 and the slope between x1 and y.
coef_plot_data <- split_data(data = dat, n_breaks = 20, variable = "x2")
l <- list()
for (i in 1:20) {
temp_dat <- subset(coef_plot_data, break_num == i)
co <- coef(glm(y ~ x1, data = temp_dat))[[2]]
l[[i]] <- data.frame(x1 = co, x2 = i)
}
coef_plot_data <- rbindlist(l)
#We see a linear relationship, like we simulated!
ggplot(data = coef_plot_data) +
geom_point(aes(x1, x2), pch = 21, fill = "gray65", color = "black") +
labs(x = "Effect size of x1 on y", y = "Quantile of x2") +
theme_classic()
#Were we see a slope of around 3, not quite what we simulated, but close.
summary(glm(x2 ~ x1, data = coef_plot_data))
##
## Call:
## glm(formula = x2 ~ x1, data = coef_plot_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2497 -0.6891 -0.1719 0.5708 2.5268
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.5404 0.7688 -5.906 1.37e-05 ***
## x1 2.9958 0.1433 20.899 4.50e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.462252)
##
## Null deviance: 665.000 on 19 degrees of freedom
## Residual deviance: 26.321 on 18 degrees of freedom
## AIC: 68.25
##
## Number of Fisher Scoring iterations: 2
Probing interactions
By evaluating the slopes at different levels of the modifier variable we “probed the interaction”. By binning the x2, we can see is high or low slopes are more important to modifying the effect of x1 on y. A novice may limit their excitement to a significant interaction, but it is possible to further explore the pattern of that interaction by asking what levels of the variable are more important modifiers. For example, we could say something like “At high values of the moderator x2 the effect of x1 on the response was very important, but at lower levels not so much”. Hypothesis tests can be applied to assess the statistical significance of the coefficient of the modifier. For some reason the term “probing” is more common in Psyc literature and is often referred to more generally as “moderator analysis”. To explore the concept of probing, Lee recommends reading Rogosa 1980 & [Bauer and Curran 2005](https://pubmed.ncbi.nlm.nih.gov/26794689/. For a video explanation using point-and-click softwareClick here
What if the interaction is not multiplicative?
Now you will notice when we simulate the response variable y, the interaction effect is exponential exp(x1+x2).
x1 <- rnorm(1000, 0, 1)
x2 <- rnorm(1000, 0, 1)
#Here I simulate an exponential interaction
y <- 5*x1 + 2 * x2 + exp(x1+x2) + rnorm(1000, 0, 2)
dat <- data.frame(x1 = x1,
x2 = x2,
y = y)
slope_dat <- split_data(data = dat, n_breaks = 6, variable = "x2")
ggplot(data = slope_dat) +
geom_point(aes(x1, y, fill = factor(break_num)), pch = 21, color = "black") +
geom_smooth(aes(x1, y, group = factor(break_num)), color = "black", se = F, method = "lm", size = 1.5) +
geom_smooth(aes(x1, y, color = factor(break_num)), se = F, method = "lm", size = 1) +
scale_fill_manual(values = wes_palette("Zissou1", length(unique(slope_dat$break_num)), type = "continuous")) +
scale_color_manual(values = wes_palette("Zissou1", length(unique(slope_dat$break_num)), type = "continuous")) +
labs(fill = "x2 quantile", color = "x2 quantile") +
theme_classic()
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Interesting, for the highest level of
x2 (red poits) we see what looks like a non-linear increase in higher values of x1. Let’s look at the plot of coefficients.
coef_plot_data <- split_data(data = dat, n_breaks = 20, variable = "x2")
l <- list()
for (i in 1:20) {
temp_dat <- subset(coef_plot_data, break_num == i)
co <- coef(glm(y ~ x1, data = temp_dat))[[2]]
l[[i]] <- data.frame(x1 = co, x2 = i)
}
coef_plot_data <- rbindlist(l)
#Here we see that the effect of x1 on y explodes at high values of x2.
ggplot(data = coef_plot_data) +
geom_point(aes(x1, x2), pch = 21, fill = "gray65", color = "black") +
labs(x = "Effect size of x1 on y", y = "Quantile of x2") +
theme_classic()
coef(glm(x2 ~ x1, data = coef_plot_data))
## (Intercept) x1
## 1.037603 1.213769
Very cool! Much more clear. For an exponential interaction, the effect of x1 on y is modified by x2 and this effect is super strong at higher values of x2.
Does the exponential model y betLet’s compare some models using AIC. Model 1 uses typical interaction coding, notice that the slope is close to the slope we got a couple of lines up. Basically this model draws a straight line through the that exponential relationship.
mod1 <- glm(y ~ x1 + x2 + x1*x2, data = dat)
summary(mod1)
##
## Call:
## glm(formula = y ~ x1 + x2 + x1 * x2, data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -9.806 -2.403 -0.548 1.594 66.070
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.8241 0.1561 18.09 <2e-16 ***
## x1 7.8708 0.1573 50.03 <2e-16 ***
## x2 4.7530 0.1589 29.92 <2e-16 ***
## x1:x2 3.1270 0.1587 19.70 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 24.3738)
##
## Null deviance: 120948 on 999 degrees of freedom
## Residual deviance: 24276 on 996 degrees of freedom
## AIC: 6037.4
##
## Number of Fisher Scoring iterations: 2
#model 2 tests what we simulated
mod2 <- glm(y ~ x1 + x2 + exp(x1*x2), data = dat)
summary(mod2)
##
## Call:
## glm(formula = y ~ x1 + x2 + exp(x1 * x2), data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -27.3037 -1.8098 -0.3427 1.6274 29.8417
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.64514 0.11323 14.53 <2e-16 ***
## x1 7.44491 0.11121 66.95 <2e-16 ***
## x2 4.54927 0.11177 40.70 <2e-16 ***
## exp(x1 * x2) 0.56825 0.01336 42.54 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 12.02502)
##
## Null deviance: 120948 on 999 degrees of freedom
## Residual deviance: 11977 on 996 degrees of freedom
## AIC: 5330.9
##
## Number of Fisher Scoring iterations: 2
#I know I have said that AIC of 2 is pretty arbitrary, but a difference of
#1500 is pretty compelling.
AIC(mod1, mod2)
## df AIC
## mod1 5 6037.378
## mod2 5 5330.858
What if the interactive variables are correlated?
Remember: Interaction terms are not incorporated into models because you suspect that the two predictors are correlated. They are incorporated because you suspect that one variable modifies the other variables effect on y. Nevertheless, to satisfy curiosity, let’s generate a response variable that is predicted by correlated variables x1 and x2, and see how a model with and without the interaction compares.
Generate correlated variable.
#correlated predictors
x1 <- rnorm(1000, 0, 1)
x2 <- x1 + rnorm(1000, 0, 0.2) # here we correlate x2 to x1
cor(x1, x2) #Cool! We successfully correlated x2 to x1
## [1] 0.9804142
Generate a response variable that is predicted by x1 and x2, no interaction.
y <- 5*x1 + -2 * x2 + rnorm(1000, 0, 2)
We can use VIF (variance inflation factor) to detect multicolinearity- which we expect in this case because we correlated the simulated variables. The value tells us by how much the variance of the coefficient estimate is inflated because of multicolinearity. 1 is the lowest value and means that the variance is the same whether the variable is in the model or not (absence of multicollinearity). Higher values indicate variables are correlated. What constitutes as higher? Some say values >5.
mod1 <- glm(y ~ x1 + x2)
vif(mod1)
## x1 x2
## 25.78121 25.78121
summary(mod1)
##
## Call:
## glm(formula = y ~ x1 + x2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.3850 -1.3645 0.0228 1.2694 6.7666
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.006565 0.063611 -0.103 0.918
## x1 5.351415 0.320067 16.720 < 2e-16 ***
## x2 -2.273416 0.314393 -7.231 9.53e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 4.044044)
##
## Null deviance: 13912.6 on 999 degrees of freedom
## Residual deviance: 4031.9 on 997 degrees of freedom
## AIC: 4240.1
##
## Number of Fisher Scoring iterations: 2
Model y with an interaction
y <- 5*x1 + -2 * x2 + 2*x1*x2 + rnorm(1000, 0, 2)
mod2 <- glm(y ~ x1 + x2 + x1*x2)
vif(mod2)
## x1 x2 x1:x2
## 25.784239 25.787843 1.000873
summary(mod2)
##
## Call:
## glm(formula = y ~ x1 + x2 + x1 * x2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.1678 -1.3044 -0.0509 1.2498 5.7364
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.007115 0.074566 -0.095 0.924
## x1 4.819437 0.308132 15.641 < 2e-16 ***
## x2 -1.788074 0.302691 -5.907 4.77e-09 ***
## x1:x2 1.994971 0.041897 47.616 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 3.747628)
##
## Null deviance: 22149.6 on 999 degrees of freedom
## Residual deviance: 3732.6 on 996 degrees of freedom
## AIC: 4165
##
## Number of Fisher Scoring iterations: 2
Whether a model has an interaction term or not, multicollinearity is detected among the two variables. The interaction term however is not.
First we will simulate data. Your job later will be to play around and modify the values of the coefficients, the equations for simulated responses and predictors to see how they change the interpretation.
Let’s first set coefficients. We are established the raw effect sizes here.
alpha = 1
beta1 = 2
beta2 = -1
beta3 = -2
beta4 = 1
Second, let’s generate covariates and some error terms that will be used to define (or “cause”) the various response variables in the next code chunk (e.g.abun, TEMP, DROUGHT, etc. ). You will notice that covariates A (two levels) and B (three levels) will serve to generate a dataset with a traditional ANOVA design.
#Factorial variables
A = c(rep(c(0), 500), rep(c(1), 500)) #### '0' 500 times, '1' 500 times
B = rep(c(rep(c(0), 250), rep(c(1), 250)), 2) #### '0'x250, '1'x250, '0'x250, '1'x250
#Error variables
e1= rnorm(1000, 10, sd=2)
e2=rnorm(1000, 5, sd=1)
e3=rnorm(1000, 1, sd=0.5)
#Interaction variables
INTerr = A*B + e1
INT=A*B #multiplicative interaction
INTadd=(A+1)+(B*3) #additive interaction
e = rnorm(1000, 10, sd=2)
Generate response and predictor variables. The code chunk below generates all sorts of variables so that later you can modify them to test all sorts of questions.
# Simulated Continuous PREDICTOR Variables
temp = A + e1 #simulating a temperature variable
precip = B + e2 #simulating a precipitation variable
interact = temp*precip ##simulating an interaction variable
drought = temp + precip + e3 #Note this is also an additive interaction if added to one of the response variables
#Simulated RESPONSE Variables
abu1 = alpha + beta1*A + beta2*B + beta3*A*B + e #`abu1` models abundance on the two categorical predictors and their interaction
abu2 = alpha + beta1*temp + beta2*precip + beta4*temp*precip + e #`abu2` models abundance two observed continuous predictors `precip` and `temp` and their interaction
z_abu1= (abu1 - mean(abu1)) / sd(abu1)##Generate a standardized abundance variable from `abu1` above
z_abu2= (abu2 - mean(abu2)) / sd(abu2)#standardized response variable of `abu1` above
mortal = beta2*abu2 + e # mortality is modeled on `abu2`
z_mortal = (mortal - mean(mortal)) / sd(mortal)#standardized response variable of `mortal`
F <- ifelse(A==0, abu1+2 , abu1-2) #this is intended to represent a correlated predictor variable. It's correlated with `abu1`
Put all the variables we created into a data frame. We will create a data frame of the raw values dat and a data frame of the standardized values.
#data
dat<- data.frame(cbind(A, B, temp, precip, drought, INTerr, INT, INTadd, abu1, abu2, z_abu1, z_abu2, mortal, z_mortal, interact))
** Exploring 2-Categorical Predictors
Let’s now apply our simulated data. We will start by modeling the unstandardized response variable. Can we estimate our original linear model parameters? We should be able to! No matter how much you mess with the beta coefficients we established, when the standardized response is used, the linear model with the interaction should yield coefficients that match them. Recall, the coefficients values above.
mod<-lm(z_abu1 ~ A + B + A*B , data = dat)
summary(mod)
##
## Call:
## lm(formula = z_abu1 ~ A + B + A * B, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9689 -0.5854 -0.0120 0.5551 3.1638
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.002668 0.054882 -0.049 0.961
## A 0.817682 0.077614 10.535 < 2e-16 ***
## B -0.373197 0.077614 -4.808 1.76e-06 ***
## A:B -0.878299 0.109763 -8.002 3.39e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8678 on 996 degrees of freedom
## Multiple R-squared: 0.2493, Adjusted R-squared: 0.247
## F-statistic: 110.2 on 3 and 996 DF, p-value: < 2.2e-16
ggplot(dat, aes(x=as.character(A), y=z_abu1, fill=as.character(B))) + geom_boxplot()
ggPredict(mod,se=TRUE,interactive=TRUE) ##if "ggPredict" is not loading into library on Mac OS verify homebrew is downloaded, type 'brew install cairo' into terminal, Then, if error with Rccp, in R studio type'install.packages("Rcpp", repos="RcppCore.github.io/drat")'
Yep, as expected:
Explore Two Continuous Predictors
mod<- lm(abu2 ~ precip + temp + precip*temp, data = dat)
ggPredict(mod,se=TRUE,interactive=TRUE)
Check for Understanding
Try modifying the coefficients and rerunning the code. See for yourself that you get the same coefficients? Now try using the standardized response variable z_abu1? Do the coefficients remain the same?
For insight on how we detect (or fail to detect) interactions you will now modify abu2, develop models and compare outputs for model fits and the coefficients. Substitute the abu2 variable in the code above with these options (non-multiplicative interactions and no main effects) and rerun code:
abu2 = alpha + beta1*temp + beta2*precip + beta3*log(temp+1)*precip + eabu2 = alpha + beta1*temp + beta2*temp + beta3*(log(temp+1)+precip) + eabu2 = alpha + beta1*(temp+precip) + eabu2 = alpha + beta1*(temp*precip) + eFor example, let’s say I reran abun = alpha + beta1*temp + beta2*precip + beta3*log(temp+1)*precip + e. I would then compare summary(mod1) and summary(mod2)for a model that:
mod1<-lm(abun ~ temp + precip + log(temp+1)*precip , data = dat)mod2<-lm(abun ~ temp + precip + temp*precip , data = dat)mod1<-aov(z_abu1 ~ A + B + INT, data=dat) # Fit an ANOVA with interaction (int)
summary(mod1)
## Df Sum Sq Mean Sq F value Pr(>F)
## A 1 35.8 35.82 47.57 9.41e-12 ***
## B 1 165.0 164.98 219.09 < 2e-16 ***
## INT 1 48.2 48.21 64.03 3.39e-15 ***
## Residuals 996 750.0 0.75
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
mod2 <-aov(z_abu1 ~ A*B, data = dat) #Fit an ANOVA with interaction (A*B)
summary(mod2)
## Df Sum Sq Mean Sq F value Pr(>F)
## A 1 35.8 35.82 47.57 9.41e-12 ***
## B 1 165.0 164.98 219.09 < 2e-16 ***
## A:B 1 48.2 48.21 64.03 3.39e-15 ***
## Residuals 996 750.0 0.75
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
mod2? d) Does including and interaction with a correlated variable reduce its effect on the response?mod1 <- aov(z_abu1 ~ A + B + F, data=dat)#Fit an ANOVA with 3 variables, no interaction
summary(mod1)
## Df Sum Sq Mean Sq F value Pr(>F)
## A 1 35.8 35.8 3.457e+29 <2e-16 ***
## B 1 165.0 165.0 1.592e+30 <2e-16 ***
## F 1 798.2 798.2 7.702e+30 <2e-16 ***
## Residuals 996 0.0 0.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
mod2<-aov(z_abu1 ~ A + B + F + A*B + A*F, data=dat)#Fit an ANOVA with 3 variables, and some interactions
summary(mod2)
## Df Sum Sq Mean Sq F value Pr(>F)
## A 1 35.8 35.8 3.458e+29 <2e-16 ***
## B 1 165.0 165.0 1.593e+30 <2e-16 ***
## F 1 798.2 798.2 7.706e+30 <2e-16 ***
## A:B 1 0.0 0.0 5.630e-01 0.453
## A:F 1 0.0 0.0 1.973e+00 0.160
## Residuals 994 0.0 0.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
mod3<-aov(z_abu1 ~ A + B + F + A*F, data=dat)#Fit an ANOVA with 3 variables, and some interactions
summary(mod3)
## Df Sum Sq Mean Sq F value Pr(>F)
## A 1 35.8 35.8 3.456e+29 <2e-16 ***
## B 1 165.0 165.0 1.592e+30 <2e-16 ***
## F 1 798.2 798.2 7.702e+30 <2e-16 ***
## A:F 1 0.0 0.0 9.590e-01 0.328
## Residuals 995 0.0 0.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
mod1 <- lm(z_abu2 ~ temp + precip+ INT, data=dat)
summary(mod1)
##
## Call:
## lm(formula = z_abu2 ~ temp + precip + INT, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.72822 -0.09407 0.00329 0.09538 0.90771
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.903379 0.037596 -183.622 < 2e-16 ***
## temp 0.391672 0.002512 155.935 < 2e-16 ***
## precip 0.499034 0.004620 108.019 < 2e-16 ***
## INT 0.033387 0.012471 2.677 0.00754 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1615 on 996 degrees of freedom
## Multiple R-squared: 0.974, Adjusted R-squared: 0.9739
## F-statistic: 1.243e+04 on 3 and 996 DF, p-value: < 2.2e-16
mod2 <- lm(abu2 ~ temp + precip + INT, data=dat)
summary(mod2)
##
## Call:
## lm(formula = abu2 ~ temp + precip + INT, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.7743 -1.7793 0.0621 1.8041 17.1693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -45.61741 0.71112 -64.149 < 2e-16 ***
## temp 7.40846 0.04751 155.935 < 2e-16 ***
## precip 9.43921 0.08738 108.019 < 2e-16 ***
## INT 0.63152 0.23588 2.677 0.00754 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.055 on 996 degrees of freedom
## Multiple R-squared: 0.974, Adjusted R-squared: 0.9739
## F-statistic: 1.243e+04 on 3 and 996 DF, p-value: < 2.2e-16
mod1<-lm(mpg~disp*as.factor(cyl), dat=autompg)
coef(mod1)
## (Intercept) disp as.factor(cyl)6
## 43.59051971 -0.13069347 -13.20026081
## as.factor(cyl)8 disp:as.factor(cyl)6 disp:as.factor(cyl)8
## -20.85706424 0.08298924 0.10817135
cs<-coef(mod1)
ggplot(autompg, aes(x=disp, y=mpg, colour=as.factor(cyl))) +
scale_x_continuous(limits = c(0, 500)) +
scale_y_continuous(limits = c(0, 50),) +
geom_point() +
geom_abline(intercept = cs[1], slope = cs[2], colour="#F8766D") +
geom_abline(intercept = cs[1] + cs[3], slope = cs[2] + cs[5],colour="#00BA38")+
geom_abline(intercept = cs[1] + cs[4], slope = cs[2] + cs[6],colour="#619CFF")
Solve using formula:
Effect of displacement on mpg in 4 cylin: Interpretation: A 4-cylinder car with zero displacement has an average mpg of 43.59 mpg. In a 4-cylinder car, the change in mpg for 1 unit engine displacement is -0.13.
Effect of displacement on mpg in 6 cylin Interpretation: A 6-cylinder car with zero displacement has an average mpg of 30.39 mpg. In a 6-cylinder car, the change in mpg for 1 unit engine displacement is -0.05.
Effect of displacement on mpg in 8 cylin Interpretation: An 8-cylinder car with zero displacement has an average mpg of 22.74 mpg. In an 8-cylinder car, the change in mpg for 1 unit engine displacement is -0.02.
2-6 I will upload answers to these later today