Regression models with interactions between categorical and continuous variables
Using the Framingham Heart Study data set, estimate a regression model using Cholesterol as the response variable with explanatory variables of sex/gender and age. Answer the following questions in your analysis.
Did you transform any of your variables? What did you do and why?
I created an indicator variable, female, coded 0 for males and 1 for females. This makes is easier to compare those in the reference group (males) from those not in the reference group (females).
I also mean-centered the three continuous independent variables (age, sbp, and dbp). This allows us to have an interpretable y-intercept.
Fit the model with Cholesterol as the response variable with sex/gender and age as the explanatory variables. Describe the fit of the model and interpret the coefficients.
Before we build our model, let’s visualize the data first. Based on the plot below, it looks like women have higher cholesterol, on average, holding age constant.
Show Code
m1 <-lm(cholesterol ~ female + age_centered, data = framingham)augment_columns(m1, framingham) %>%ggplot(aes(age_centered, cholesterol,color = sex)) +geom_point() +geom_line(aes(y = .fitted), size =1) +#stat_summary(fun = mean, geom = "crossbar") +labs(x ="Age (centered)",y ="Cholesterol (centered)",title ="What is the relationship between cholesterol and age/gender?")
Now, let’s model our data to see if these differences are statistically significant.
Show Code
# get model coefficientsm1 %>%summary()
Call:
lm(formula = cholesterol ~ female + age_centered, data = framingham)
Residuals:
Min 1Q Median 3Q Max
-133.434 -30.319 -3.914 27.980 186.028
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 225.8398 1.7552 128.673 < 2e-16 ***
female 16.9053 2.4243 6.973 4.75e-12 ***
age_centered 0.7891 0.2531 3.117 0.00186 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 45.4 on 1403 degrees of freedom
Multiple R-squared: 0.03965, Adjusted R-squared: 0.03828
F-statistic: 28.96 on 2 and 1403 DF, p-value: 4.736e-13
Show Code
# extract model coefficients and use the values to write in-line codem1_coefficients <- m1$coefficients %>%round(2)#m1_coefficients
Model 1 has an adjusted r-squared of .04 meaning that our model accounts for 4% of the variation in our dependent variable, cholesterol. We know this model is an improvement on the mean given F = 28.96, p < 0.001.
The y-intercept for model 1 is 225.84. Because we mean-centered the dependent variables, this is the expected cholesterol for a male with the mean age in the data set.
The coefficient for female is 16.91. This is the expected difference in cholesterol between men and women, holding age constant.
The coefficient for age is 0.79. This is the expected increase in cholesterol for every 1 year increase in age for people of the same sex.
Add an interaction between sex/gender and age to the model. Describe the fit of this model and interpret the coefficients. Hint: you should write down the separate regression equations for males and females.
Before we build model 2, let’s visualize the data just like we did for model 1.
Show Code
ggplot(framingham, aes(age_centered, cholesterol,color = sex)) +geom_point() +geom_smooth(method ="lm", se = F) +stat_regline_equation() +labs(x ="Age (centered)",y ="Cholesterol",title ="Does age affect cholesterol differently for men vs women?")
Call:
lm(formula = cholesterol ~ female + age_centered + female:age_centered,
data = framingham)
Residuals:
Min 1Q Median 3Q Max
-128.645 -30.394 -3.645 28.012 184.408
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 225.8925 1.7453 129.431 < 2e-16 ***
female 16.8982 2.4106 7.010 3.69e-12 ***
age_centered -0.2738 0.3603 -0.760 0.447
female:age_centered 2.0757 0.5035 4.122 3.97e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 45.14 on 1402 degrees of freedom
Multiple R-squared: 0.05115, Adjusted R-squared: 0.04912
F-statistic: 25.19 on 3 and 1402 DF, p-value: 7.116e-16
Model 2 produced an adjusted r-squared of .05, which means this model accounts for 5% of the variation in the response variable.
The y-intercept for model 2 is 226. This is the cholesterol we would expect for a male with the mean age of the data set.
The coefficient for female is 17. This the difference in cholesterol we would expect for men and women, holding the other variables in the model constant.
The coefficient for age is -0.27. This is the average decrease in cholesterol for every 1 year increase in age, holding the other variables in the model constant..
Lastly, 2.08 is the coefficient for the interaction. This is the expected increase in cholesterol for every one year increase in age for women over men.
Compare the results from 2 and 3. What do you conclude?
Show Code
anova(m1, m2)
Analysis of Variance Table
Model 1: cholesterol ~ female + age_centered
Model 2: cholesterol ~ female + age_centered + female:age_centered
Res.Df RSS Df Sum of Sq F Pr(>F)
1 1403 2891283
2 1402 2856661 1 34622 16.992 3.974e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Show Code
modelsummary(list(m1, m2), stars = T)
Model 1
Model 2
(Intercept)
225.840***
225.892***
(1.755)
(1.745)
female
16.905***
16.898***
(2.424)
(2.411)
age_centered
0.789**
−0.274
(0.253)
(0.360)
female × age_centered
2.076***
(0.504)
Num.Obs.
1406
1406
R2
0.040
0.051
R2 Adj.
0.038
0.049
AIC
14724.0
14709.1
BIC
14745.0
14735.3
Log.Lik.
−7358.008
−7349.540
RMSE
45.35
45.08
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Model 2 produced a statistically significantly higher adjusted r-squared than model 1 (p < 0.001). This means that model 2 with the interaction is a better model than model 1. The y- intercepts, y = 226, are basically the same for both models. This is the expected value of cholesterol for an averaged-aged male.
The female intercept for model 2 is still statistically significant even with the interaction added. This means that we can expect females to have cholesterol that is, on average, 17 units higher than males, controlling for age and the interaction of sex and age.
The coefficient for age is no longer statistically significant compared to the first model. This means that the statistical significance for age that we picked up on in the first model was actually being explained by the interaction between sex and age.
Thus, there is a statistically significant interaction between age and sex such that the effect of age on cholesterol is higher for women than for men.
Computing odds and log-odds ratios
Below is a table showing the frequencies for a binary variable, Likely to go to college. The variable takes the value 0 when the respondent indicates a less than 50% of going to college, and a value of 1 when the respondent indicates a high likelihood of going to college.
Using the Valid Percent (as outlined by the blue box), compute the odds of going to college in this sample.
Using the Valid Percent (as outlined by the blue box), compute the log(odds) of going to college. Note that we use the natural log. You can use a calculator or Excel’s function ln.
Show Code
# Note to self. If the odds are equal, that means the odds are 1 (50/50). # Also, th log odds of 1 is 0 [log(1) = 0].# Therefore, if your log odds is positive, that means the chances of success are greater than the chances of failure. # And if your log odds is negative, that means your changes of failure are greater than your chances of success. log(2.6)
[1] 0.9555114
Show Code
# Note: When the log(odds) is 0
Using the Valid Percent (as outlined by the blue box), compute the odds of not going to college in this sample.
Using the Valid Percent (as outlined by the blue box), compute the log(odds) of not going to college. Note that we use the natural log. You can use a calculator or Excel’s function ln.
Show Code
log(.39)
[1] -0.9416085
Interpreting the odds and log-odds
Suppose a study reports an odds of 1.2 for a survey that asked respondents if they would vote by mail. (Voting by mail is the “success” in this scenario). What is more likely in this survey – voting by mail or not voting by mail?
Voting by mail is more likely because the odds are 1.2. An odd of 1 would mean there’s an equal chance of success and failure. Since 1.2 > 1, that means the chances of success are greater than the chances of failure.
Suppose a study reports the odds of 0.3 for whether college freshman plan to return to campus. (Returning to campus is the “success” in this study). What result is more likely in this survey – college freshman returning to campus or not?
Not returning to campus is more likey. We know this because the odds of returning to campus are .3, which is less than 1. Odds of less than 1 means failure is more likely than success.
Suppose a study reports a log-odds of -0.7 for whether respondents plan to travel for Thanksgiving this year. What result is more likely in this survey – that respondents plan to travel or not?
Because the log odds are negative (-0.7), this means the odds must be less than 1. When the odds are less than 1, that means failure is more likely than success. Therefore, not traveling is more likely.
Suppose a study reports a log-odds of 0.8 for whether respondents plan to get the flu vaccine this year. What result is more likely in this study – to get vaccinated for the flu or not to get vaccinated?
Because the log odds are positive (0.8), this means the odds must be greater than 1. When the odds are greater than 1, this means success is more likely than failure. Therefore, getting vaccinated is more likely.