Research Scenario

Example: Last week, we examined whether people’s conscientiousness predicted how good of health they reported being in. However, there are likely many variables in addition to conscientiousness that contribute to predicting people’s overall health! Multiple regression allows us to examine how multiple predictors contribute to predicting an outcome variable when the other predictors are present in the model.

The data set we’re using in today’s lab contains the following variables:

row = Row number
pid = Participant ID number
consc = Participant’s conscientiousness
exercise = The number of days a week the participant reported that they exercise
soc_supp = The participant’s perception of the amount of social support they have from the people around them from 1 (none at all) to 5 (a great deal)
sr_health = Participant’s self-reported health

A researcher is interested in whether conscientiousness, exercise, and social support each individually significantly predict self-reported health scores when the other predictors are present in the model.

Import Data

data <- import("health.csv")

Data Cleaning & Wrangling

First, let’s check the measure type that each variable was imported as.

str(data)

## 'data.frame':    60 obs. of  6 variables:
##  $ row      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ pid      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ consc    : num  2.66 5.02 3.85 1.73 3.45 ...
##  $ exercise : int  0 6 3 1 3 4 2 6 3 1 ...
##  $ soc_supp : int  4 4 3 2 2 5 4 2 3 1 ...
##  $ sr_health: num  1.91 4.49 3.4 2.05 4.21 ...

Convert the variable’s measures types as needed below.

data <- data %>%
  mutate(row = as.factor(row),
         pid = as.factor(pid))

Descriptive Statistics

Let’s construct a descriptive statistics table that provides the mean and standard deviation for the key variables of interest in this study.

descriptives_table <- data %>%
  summarise(M_Consc = mean(consc),
            SD_Consc = sd(consc),
            M_Exercise = mean(exercise),
            SD_Exercise = sd(exercise),
            M_SocialSupport = mean(soc_supp),
            SD_SocialSupport = sd(soc_supp),
            M_Health = mean(sr_health),
            SD_Health = sd(sr_health))

# raw table with no formatting
descriptives_table

##    M_Consc  SD_Consc M_Exercise SD_Exercise M_SocialSupport SD_SocialSupport
## 1 2.847374 0.9702901        2.4    1.509293            2.65         1.161895
##   M_Health SD_Health
## 1 3.053342 0.9985299

# adding table formatting
descriptives_table %>%
  knitr::kable(digits = 2, col.names = c("Mean Conscientiousness", "SD Conscientiousness", "Mean Exercise", "SD Exercise", "Mean Social Support", "SD Social Support", "Mean Health", "SD Health"), caption = "Descriptive Statistics for Conscientiousness and Health", format.args = list(nsmall = 2))

Descriptive Statistics for Conscientiousness and Health
Mean Conscientiousness	SD Conscientiousness	Mean Exercise	SD Exercise	Mean Social Support	SD Social Support	Mean Health	SD Health
2.85	0.97	2.40	1.51	2.65	1.16	3.05	1.00

Data Visualization

We can create a scatterplot matrix using the pairs.panels() function to visualize the relationship between each pair of variables in the study.

data %>%
  dplyr::select(consc, exercise, soc_supp, sr_health) %>%
  pairs.panels(lm = TRUE)

Centering the Continuous Predictors

Let’s center the continuous predictors prior to fitting the model. Remember that centering the continuous predictors serves to:

Create a more meaningful value for b0 (the y-intercept will occur at the mean of the predictor variables)
Reduce redundancy between the continuous predictors with the continuous X continuous interaction term

Although we won’t be including a continuous X continuous interaction term in this week’s lab (we will in next week’s lab!), we’re going to go ahead and center the continuous predictors to produce a more meaningful y-intercept value.

To center the continuous predictors, we’ll create new variables in our data set after centering each of the predictors around their means using the scale() function.

data$consc_c <- scale(data$consc, center = TRUE, scale = FALSE) 
data$exercise_c <- scale(data$exercise, center = TRUE, scale = FALSE)
data$soc_supp_c <- scale(data$soc_supp, center = TRUE, scale = FALSE)

Fit the Model

Fit a linear model predicting sr_health from consc_c, exercise_c, and soc_supp_c. Call the object model.

model <- lm(sr_health ~ consc_c + exercise_c + soc_supp_c, data = data)

Before we interpret the model output, let’s check to make sure the model assumptions were met.

Assessing Multicollinearity

Multicollinearity occurs when one or more predictors in our model are highly correlated.

First, this poses an issue because if two or more predictors are highly redundant, it becomes difficult to tell what the unique relationship each has with the outcome variable is.

Second, high multicollinearity makes the confidence intervals around predictors’ parameter estimates wider, which makes it less likely for a predictor to be significant.

We can assess the multicollinearity among predictors in a model by calculating the tolerance for each predictor, where tolerance is:

\[Tolerance = 1 - R_{p}^2\]

where \(R_{p}^2\) is the R-squared value resulting from a model in which a particular predictor, p, is predicted by all the other p-1 predictors in the model.

Thus, tolerance is a measure of how much a predictor’s variance is unique from the other predictors included in the model.

Another measure of multicollinearity that is often calculated is called VIF and is simply the inverse of tolerance:

\[VIF = \frac{1}{Tolerance}\] We can obtain both of these measures of multicollinearity for the predictors in our model by using the ols_vif_tol() function from the olsrr package.

ols_vif_tol(model)

##    Variables Tolerance      VIF
## 1    consc_c 0.7216885 1.385639
## 2 exercise_c 0.7121114 1.404275
## 3 soc_supp_c 0.6752128 1.481015

Either a low tolerance (below 0.20 is one rule of thumb) or a high VIF (above 5 or 10) is an indication of a problem with multicollinearity. Looks like multicollinearity does not pose an extreme problem with the current set of predictors!

Check whether model assumptions were met

Next, let’s check whether the assumptions of 1) normally distributed errors, 2) independence of errors, and 3) homogeneity of variances were met by the current model.

Checking whether errors are normally distributed

Distribution of the Residuals

# storing table containing residuals
augment_model <- augment(model)
augment_model

## # A tibble: 60 × 10
##    sr_health consc_c[,1] exercise_c[,1] soc_supp_c[,1] .fitted  .resid   .hat
##        <dbl>       <dbl>          <dbl>          <dbl>   <dbl>   <dbl>  <dbl>
##  1      1.91     -0.188            -2.4           1.35    2.35 -0.443  0.144 
##  2      4.49      2.17              3.6           1.35    5.08 -0.586  0.148 
##  3      3.40      1.00              0.6           0.35    3.46 -0.0527 0.0355
##  4      2.05     -1.11             -1.4          -0.65    2.22 -0.174  0.0434
##  5      4.21      0.606             0.6          -0.65    3.18  1.03   0.0440
##  6      5.11      1.16              1.6           2.35    4.42  0.693  0.0872
##  7      3.09     -0.0301           -0.4           1.35    3.23 -0.140  0.0581
##  8      4.58      1.06              3.6          -0.65    4.50  0.0734 0.189 
##  9      3.90      0.255             0.6           0.35    3.42  0.485  0.0197
## 10      2.64      0.192            -1.4          -1.65    2.03  0.611  0.0726
## # ℹ 50 more rows
## # ℹ 3 more variables: .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>

# plotting histogram of residuals
ggplot(data = augment_model, aes(x = .resid)) + 
  geom_density(fill = "purple") + # histogram of the residuals
  stat_function(linetype = 2, # a normal distribution overlaid on top for reference
                fun = dnorm, 
                args = list(mean = mean(augment_model$.resid), 
                                sd   =   sd(augment_model$.resid)))

Q-Q Plot

ggplot(model) +
  geom_abline() + 
  stat_qq(aes(sample = .stdresid))

The residuals appear to be approximately normally distributed.

Checking independence of errors

# add the ID column to the table containing the residuals
augment_model$pid <- data$pid

# plot the ID numbers against the residuals
ggplot(data = augment_model, aes(x = pid, y = .resid)) + 
  geom_point() +  
  geom_smooth(se = F) +
  geom_hline(yintercept = 0)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

There does not appear to be a systematic pattern between the model’s residuals and participant idea - good!

Checking homogeneity of variance

plot(model, 1)

The residuals look to have approximately the same amount of variability across the range of the model’s fitted values - good!

Interpreting the Model Output

Recall that we can examine the model output using different functions:

summary() & confint()
anova()

Let’s start with summary() and confint().

summary(model)

## 
## Call:
## lm(formula = sr_health ~ consc_c + exercise_c + soc_supp_c, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.16780 -0.36793 -0.08951  0.36588  1.05905 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.05334    0.06467  47.216  < 2e-16 ***
## consc_c      0.05170    0.07911   0.653 0.516158    
## exercise_c   0.43423    0.05120   8.481 1.26e-11 ***
## soc_supp_c   0.26001    0.06830   3.807 0.000351 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5009 on 56 degrees of freedom
## Multiple R-squared:  0.7611, Adjusted R-squared:  0.7484 
## F-statistic: 59.48 on 3 and 56 DF,  p-value: < 2.2e-16

confint(model)

##                  2.5 %    97.5 %
## (Intercept)  2.9237989 3.1828860
## consc_c     -0.1067902 0.2101804
## exercise_c   0.3316653 0.5368041
## soc_supp_c   0.1231781 0.3968362

Question: Interpret the meaning of each of the model’s parameter estimates.

b0 = 3.05, the predicted self-reported health score for someone who scores equal to the average conscientiousness, average number of days spent exercising, and average perceived social support

b1 = 0.05, when controlling for exercise and social support, the model predicts that for every 1-unit increase in conscientiousness, self-reported health scores will increase by 0.05 units

b2 = 0.43, when controlling for conscientiousness and social support, the model predicts that for every 1-unit increase in exercise, self-reported health scores will increase by 0.43 units

b3 = 0.26, when controlling for conscientiousness and exercise, the model predicts that for every 10unit increase in social support, self-reported health scores will increase by 0.26 units

Question: As a set, did conscientiousness, exercise, and social support altogether significantly predict self-reported health scores?

Yes, \(R^2\) = 0.76, F(3, 56) = 59.48, p < .001.

Question: Individually, which of the predictors significantly predicted self-reported health scores when controlling for the other predictors in the model?

Controlling for exercise and social support, conscientiousness did not significantly predict self-reported health scores, b1 = 0.05, 95%CI[-0.11, 0.21], t(56) = 0.65, p = 0.516.

Controlling for conscientiousness and social support, number of days spent exercising significantly, positively predicted self-reported health scores, b2 = 0.43, 95%CI[0.31, 0.56], t(56) = 8.48, p < .001.

Controlling for conscientiousness and exercise, perceived levels of social support significantly, positively predicted self-reported health scores, b3 = 0.26, 95%CI[0.12, 0.49], t(56) = 3.81, p < .001.

Next, let’s examine the Anova() output. We have to use the Anova() function from the car package in order to choose the correct type of sums of squares to be consistent with the calculation used by the lm() function. We want to choose Type 3 sums of squares, as shown below.

Anova(model, type = 3)

## Anova Table (Type III tests)
## 
## Response: sr_health
##             Sum Sq Df  F value    Pr(>F)    
## (Intercept) 559.37  1 2229.388 < 2.2e-16 ***
## consc_c       0.11  1    0.427 0.5161583    
## exercise_c   18.05  1   71.925 1.258e-11 ***
## soc_supp_c    3.64  1   14.490 0.0003513 ***
## Residuals    14.05 56                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We could also report on the significance of each of the predictors in the model based on the F-statistics shown in the Anova() output, but it would be redundant with the results we reported from the summary() output. It’s an alternative option for reporting on the results, though (and useful for checking by-hand calculations)!

Effect Size

We can get two measures of effect size, \(\eta^2\) and \(\eta^2_{Partial}\) by passing our model to the etaSquared() function.

Recall from lecture the difference between how these two measures of effect size are calculated:

\[\eta^2 = SSR/SS_{Total}\]

\[\eta^2_{Partial} = SSR/SSE(C)\] When calculating \(\eta^2\), the denominator will stay the same across all three predictors.

However, when calculating \(\eta^2_{Partial}\) for each predictor, the denominator will correspond to the SSE remaining in a model that predicts self-reported health scores from the other two predictors.

Again, let’s set type = 3 so the type of sums of squares is consistent with how the lm() function calculates SS.

etaSquared(model, type = 3)

##                 eta.sq eta.sq.part
## consc_c    0.001821084 0.007566601
## exercise_c 0.306776488 0.562243320
## soc_supp_c 0.061804894 0.205565569

Question: How would you interpret the size of each predictor’s relationship to self-reported health scores?

Based on the eta-squared values, conscientiousness has an effect size near zero, exercise has a large effect size, and social support has a small effect size.

APA-Style Summary

Using a multiple linear regression analysis, we found that, as a set, conscientiousness, exercise, and social support accounted for a significant amount of the variability in people’s self-reported health scores, \(R^2\) = 0.76, F(3, 56) = 59.48, p < .001.

Specifically, controlling for conscientiousness and social support, number of days spent exercising significantly, positively predicted self-reported health scores, b2 = 0.43, 95%CI[0.31, 0.56], t(56) = 8.48, p < .001. Exercise had a large effect size, \(\eta^2\) = 0.31, \(\eta^2_{Partial}\) = 0.56.

Additionally, controlling for conscientiousness and exercise, perceived levels of social support also significantly, positively predicted self-reported health scores, b3 = 0.26, 95%CI[0.12, 0.49], t(56) = 3.81, p < .001. Perceived social support had a small effect size, \(\eta^2\) = 0.06, \(\eta^2_{Partial}\) = 0.21.

Finally, controlling for exercise and social support, conscientiousness did not significantly predict self-reported health scores, b1 = 0.05, 95%CI[-0.11, 0.21], t(56) = 0.65, p = 0.516. Additionally, the effect size corresponding to conscientiousness was near zero, \(\eta^2\) = 0.00, \(\eta^2_{Partial}\) = 0.01.

Lab 3 - Linear Regression with Multiple Continuous Predictors