Example: Last week, we examined whether people’s conscientiousness predicted how good of health they reported being in. However, there are likely many variables in addition to conscientiousness that contribute to predicting people’s overall health! Multiple regression allows us to examine how multiple predictors contribute to predicting an outcome variable when the other predictors are present in the model.
The data set we’re using in today’s lab contains the following variables:
A researcher is interested in whether conscientiousness, exercise, and social support each individually significantly predict self-reported health scores when the other predictors are present in the model.
data <- import("health.csv")
First, let’s check the measure type that each variable was imported as.
str(data)
## 'data.frame': 60 obs. of 6 variables:
## $ row : int 1 2 3 4 5 6 7 8 9 10 ...
## $ pid : int 1 2 3 4 5 6 7 8 9 10 ...
## $ consc : num 2.66 5.02 3.85 1.73 3.45 ...
## $ exercise : int 0 6 3 1 3 4 2 6 3 1 ...
## $ soc_supp : int 4 4 3 2 2 5 4 2 3 1 ...
## $ sr_health: num 1.91 4.49 3.4 2.05 4.21 ...
Convert the variable’s measures types as needed below.
data <- data %>%
mutate(row = as.factor(row),
pid = as.factor(pid))
Let’s construct a descriptive statistics table that provides the mean and standard deviation for the key variables of interest in this study.
descriptives_table <- data %>%
summarise(M_Consc = mean(consc),
SD_Consc = sd(consc),
M_Exercise = mean(exercise),
SD_Exercise = sd(exercise),
M_SocialSupport = mean(soc_supp),
SD_SocialSupport = sd(soc_supp),
M_Health = mean(sr_health),
SD_Health = sd(sr_health))
# raw table with no formatting
descriptives_table
## M_Consc SD_Consc M_Exercise SD_Exercise M_SocialSupport SD_SocialSupport
## 1 2.847374 0.9702901 2.4 1.509293 2.65 1.161895
## M_Health SD_Health
## 1 3.053342 0.9985299
# adding table formatting
descriptives_table %>%
knitr::kable(digits = 2, col.names = c("Mean Conscientiousness", "SD Conscientiousness", "Mean Exercise", "SD Exercise", "Mean Social Support", "SD Social Support", "Mean Health", "SD Health"), caption = "Descriptive Statistics for Conscientiousness and Health", format.args = list(nsmall = 2))
| Mean Conscientiousness | SD Conscientiousness | Mean Exercise | SD Exercise | Mean Social Support | SD Social Support | Mean Health | SD Health |
|---|---|---|---|---|---|---|---|
| 2.85 | 0.97 | 2.40 | 1.51 | 2.65 | 1.16 | 3.05 | 1.00 |
We can create a scatterplot matrix using the
pairs.panels() function to visualize the relationship
between each pair of variables in the study.
data %>%
dplyr::select(consc, exercise, soc_supp, sr_health) %>%
pairs.panels(lm = TRUE)
Let’s center the continuous predictors prior to fitting the model. Remember that centering the continuous predictors serves to:
Although we won’t be including a continuous X continuous interaction term in this week’s lab (we will in next week’s lab!), we’re going to go ahead and center the continuous predictors to produce a more meaningful y-intercept value.
To center the continuous predictors, we’ll create new
variables in our data set after centering each of the
predictors around their means using the scale()
function.
data$consc_c <- scale(data$consc, center = TRUE, scale = FALSE)
data$exercise_c <- scale(data$exercise, center = TRUE, scale = FALSE)
data$soc_supp_c <- scale(data$soc_supp, center = TRUE, scale = FALSE)
Fit a linear model predicting sr_health from
consc_c, exercise_c, and
soc_supp_c. Call the object model.
model <- lm(sr_health ~ consc_c + exercise_c + soc_supp_c, data = data)
Before we interpret the model output, let’s check to make sure the model assumptions were met.
Multicollinearity occurs when one or more predictors in our model are highly correlated.
First, this poses an issue because if two or more predictors are highly redundant, it becomes difficult to tell what the unique relationship each has with the outcome variable is.
Second, high multicollinearity makes the confidence intervals around predictors’ parameter estimates wider, which makes it less likely for a predictor to be significant.
We can assess the multicollinearity among predictors in a model by calculating the tolerance for each predictor, where tolerance is:
\[Tolerance = 1 - R_{p}^2\]
Thus, tolerance is a measure of how much a predictor’s variance is unique from the other predictors included in the model.
Another measure of multicollinearity that is often calculated is called VIF and is simply the inverse of tolerance:
\[VIF = \frac{1}{Tolerance}\] We
can obtain both of these measures of multicollinearity for the
predictors in our model by using the ols_vif_tol() function
from the olsrr package.
ols_vif_tol(model)
## Variables Tolerance VIF
## 1 consc_c 0.7216885 1.385639
## 2 exercise_c 0.7121114 1.404275
## 3 soc_supp_c 0.6752128 1.481015
Either a low tolerance (below 0.20 is one rule of thumb) or a high VIF (above 5 or 10) is an indication of a problem with multicollinearity. Looks like multicollinearity does not pose an extreme problem with the current set of predictors!
Next, let’s check whether the assumptions of 1) normally distributed errors, 2) independence of errors, and 3) homogeneity of variances were met by the current model.
# storing table containing residuals
augment_model <- augment(model)
augment_model
## # A tibble: 60 × 10
## sr_health consc_c[,1] exercise_c[,1] soc_supp_c[,1] .fitted .resid .hat
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.91 -0.188 -2.4 1.35 2.35 -0.443 0.144
## 2 4.49 2.17 3.6 1.35 5.08 -0.586 0.148
## 3 3.40 1.00 0.6 0.35 3.46 -0.0527 0.0355
## 4 2.05 -1.11 -1.4 -0.65 2.22 -0.174 0.0434
## 5 4.21 0.606 0.6 -0.65 3.18 1.03 0.0440
## 6 5.11 1.16 1.6 2.35 4.42 0.693 0.0872
## 7 3.09 -0.0301 -0.4 1.35 3.23 -0.140 0.0581
## 8 4.58 1.06 3.6 -0.65 4.50 0.0734 0.189
## 9 3.90 0.255 0.6 0.35 3.42 0.485 0.0197
## 10 2.64 0.192 -1.4 -1.65 2.03 0.611 0.0726
## # ℹ 50 more rows
## # ℹ 3 more variables: .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>
# plotting histogram of residuals
ggplot(data = augment_model, aes(x = .resid)) +
geom_density(fill = "purple") + # histogram of the residuals
stat_function(linetype = 2, # a normal distribution overlaid on top for reference
fun = dnorm,
args = list(mean = mean(augment_model$.resid),
sd = sd(augment_model$.resid)))
ggplot(model) +
geom_abline() +
stat_qq(aes(sample = .stdresid))
The residuals appear to be approximately normally distributed.
# add the ID column to the table containing the residuals
augment_model$pid <- data$pid
# plot the ID numbers against the residuals
ggplot(data = augment_model, aes(x = pid, y = .resid)) +
geom_point() +
geom_smooth(se = F) +
geom_hline(yintercept = 0)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
There does not appear to be a systematic pattern between the model’s residuals and participant idea - good!
plot(model, 1)
The residuals look to have approximately the same amount of variability across the range of the model’s fitted values - good!
Recall that we can examine the model output using different functions:
Let’s start with summary() and
confint().
summary(model)
##
## Call:
## lm(formula = sr_health ~ consc_c + exercise_c + soc_supp_c, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.16780 -0.36793 -0.08951 0.36588 1.05905
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.05334 0.06467 47.216 < 2e-16 ***
## consc_c 0.05170 0.07911 0.653 0.516158
## exercise_c 0.43423 0.05120 8.481 1.26e-11 ***
## soc_supp_c 0.26001 0.06830 3.807 0.000351 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5009 on 56 degrees of freedom
## Multiple R-squared: 0.7611, Adjusted R-squared: 0.7484
## F-statistic: 59.48 on 3 and 56 DF, p-value: < 2.2e-16
confint(model)
## 2.5 % 97.5 %
## (Intercept) 2.9237989 3.1828860
## consc_c -0.1067902 0.2101804
## exercise_c 0.3316653 0.5368041
## soc_supp_c 0.1231781 0.3968362
Question: Interpret the meaning of each of the model’s parameter estimates.
b0 = 3.05, the predicted self-reported health score for someone who scores equal to the average conscientiousness, average number of days spent exercising, and average perceived social support
b1 = 0.05, when controlling for exercise and social support, the model predicts that for every 1-unit increase in conscientiousness, self-reported health scores will increase by 0.05 units
b2 = 0.43, when controlling for conscientiousness and social support, the model predicts that for every 1-unit increase in exercise, self-reported health scores will increase by 0.43 units
b3 = 0.26, when controlling for conscientiousness and exercise, the model predicts that for every 10unit increase in social support, self-reported health scores will increase by 0.26 units
Question: As a set, did conscientiousness, exercise, and social support altogether significantly predict self-reported health scores?
Yes, \(R^2\) = 0.76, F(3, 56) = 59.48, p < .001.
Question: Individually, which of the predictors significantly predicted self-reported health scores when controlling for the other predictors in the model?
Controlling for exercise and social support, conscientiousness did not significantly predict self-reported health scores, b1 = 0.05, 95%CI[-0.11, 0.21], t(56) = 0.65, p = 0.516.
Controlling for conscientiousness and social support, number of days spent exercising significantly, positively predicted self-reported health scores, b2 = 0.43, 95%CI[0.31, 0.56], t(56) = 8.48, p < .001.
Controlling for conscientiousness and exercise, perceived levels of social support significantly, positively predicted self-reported health scores, b3 = 0.26, 95%CI[0.12, 0.49], t(56) = 3.81, p < .001.
Next, let’s examine the Anova() output. We have to use
the Anova() function from the car package in
order to choose the correct type of sums of squares to be consistent
with the calculation used by the lm() function. We want to choose Type 3
sums of squares, as shown below.
Anova(model, type = 3)
## Anova Table (Type III tests)
##
## Response: sr_health
## Sum Sq Df F value Pr(>F)
## (Intercept) 559.37 1 2229.388 < 2.2e-16 ***
## consc_c 0.11 1 0.427 0.5161583
## exercise_c 18.05 1 71.925 1.258e-11 ***
## soc_supp_c 3.64 1 14.490 0.0003513 ***
## Residuals 14.05 56
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We could also report on the significance of each of the predictors in
the model based on the F-statistics shown in the Anova()
output, but it would be redundant with the results we reported from the
summary() output. It’s an alternative option for reporting
on the results, though (and useful for checking by-hand
calculations)!
We can get two measures of effect size, \(\eta^2\) and \(\eta^2_{Partial}\) by passing our model to
the etaSquared() function.
Recall from lecture the difference between how these two measures of effect size are calculated:
\[\eta^2 = SSR/SS_{Total}\]
\[\eta^2_{Partial} = SSR/SSE(C)\] When calculating \(\eta^2\), the denominator will stay the same across all three predictors.
However, when calculating \(\eta^2_{Partial}\) for each predictor, the denominator will correspond to the SSE remaining in a model that predicts self-reported health scores from the other two predictors.
Again, let’s set type = 3 so the type of sums of squares
is consistent with how the lm() function calculates SS.
etaSquared(model, type = 3)
## eta.sq eta.sq.part
## consc_c 0.001821084 0.007566601
## exercise_c 0.306776488 0.562243320
## soc_supp_c 0.061804894 0.205565569
Question: How would you interpret the size of each predictor’s relationship to self-reported health scores?
Based on the eta-squared values, conscientiousness has an effect size near zero, exercise has a large effect size, and social support has a small effect size.
Using a multiple linear regression analysis, we found that, as a set, conscientiousness, exercise, and social support accounted for a significant amount of the variability in people’s self-reported health scores, \(R^2\) = 0.76, F(3, 56) = 59.48, p < .001.
Specifically, controlling for conscientiousness and social support, number of days spent exercising significantly, positively predicted self-reported health scores, b2 = 0.43, 95%CI[0.31, 0.56], t(56) = 8.48, p < .001. Exercise had a large effect size, \(\eta^2\) = 0.31, \(\eta^2_{Partial}\) = 0.56.
Additionally, controlling for conscientiousness and exercise, perceived levels of social support also significantly, positively predicted self-reported health scores, b3 = 0.26, 95%CI[0.12, 0.49], t(56) = 3.81, p < .001. Perceived social support had a small effect size, \(\eta^2\) = 0.06, \(\eta^2_{Partial}\) = 0.21.
Finally, controlling for exercise and social support, conscientiousness did not significantly predict self-reported health scores, b1 = 0.05, 95%CI[-0.11, 0.21], t(56) = 0.65, p = 0.516. Additionally, the effect size corresponding to conscientiousness was near zero, \(\eta^2\) = 0.00, \(\eta^2_{Partial}\) = 0.01.