Multiple Linear Regression Lab

Author

Evie Bottoms

Loading Libraries

library(psych) # for the describe() command
library(car) # for the vif() command
library(sjPlot) # to visualize our results

Importing Data

d <- read.csv(file="Data/mydata.csv", header=T)

# since we're focusing on our continuous variables, we're going to drop our categorical variables. this will make some stuff we're doing later easier.
d <- subset(d, select=-c(gender, phys_sym))

Examine Your Variables

# use the scale() command to standardize all of your variables (IVs and DVs)
d$moa_independence <- scale(d$moa_independence, center=T, scale=T)
d$swb <- scale(d$swb, center=T, scale=T)
d$belong <- scale(d$belong, center=T, scale=T)
d$stress <- scale(d$stress, center=T, scale=T)

# also use histograms to examine your continuous variables
hist(d$moa_independence)

hist(d$swb)

hist(d$belong)

hist(d$stress)

# use scatterplots to examine your continuous variables in pairs
plot(d$moa_independence, d$swb)

plot(d$moa_independence, d$belong)

plot(d$moa_independence, d$stress)

plot(d$swb, d$belong)

plot(d$swb, d$stress)

plot(d$belong, d$stress)

# create a correlation matrix to examine the relationships between your variables
corr_output_m <- corr.test(d)
corr_output_m

Call:corr.test(x = d)
Correlation matrix 
                 moa_independence   swb belong stress
moa_independence             1.00  0.10   0.02  -0.03
swb                          0.10  1.00  -0.15  -0.51
belong                       0.02 -0.15   1.00   0.30
stress                      -0.03 -0.51   0.30   1.00
Sample Size 
[1] 3090
Probability values (Entries above the diagonal are adjusted for multiple tests.) 
                 moa_independence swb belong stress
moa_independence             0.00   0   0.32   0.32
swb                          0.00   0   0.00   0.00
belong                       0.30   0   0.00   0.00
stress                       0.16   0   0.00   0.00

 To see confidence intervals of the correlations, print with the short=FALSE option

State Your Hypothesis - PART OF YOUR WRITEUP

H1: We predict that markers of adulthood will be positively related to perceived stress score.

H2: We predict that satisfaction with life will be negatively related to perceived stress score.

H3: We predict that the need to belong will be positively related to perceived stress score.

H4: When we exam all three independent variables in conjunction with each other, satisfaction of life will be the strongest predictor of perceived stress score. The effect of markers of adulthood will disappear when accomodating the effects of the other variables.

Higher satisfaction with life will be associated with lower levels of perceived stress, while a stronger need to belong and a greater number of markers of adulthood will be associated with higher levels of perceived stress.Satisfaction with Life will be negatively related to Perceived Stress, such that higher scores on Satisfaction with Life will be predictive of lower scores on Perceived Stress. Need to Belong will be positively related to Perceived Stress, such that higher scores on Need to Belong will be predictive of higher scores on Perceived Stress. Markers of Adulthood will be positively related to Perceived Stress, such that higher scores on Markers of Adulthood will be predictive of higher scores on Perceived Stress.

Check Your Assumptions

Multiple Linear Regression Assumptions

Observations should be independent (confirmed by data report)
Number of cases should be adequate (N ≥ 80 + 8m, where m is the number of IVs). If you don’t have enough, it won’t run. (will check this below)
Independent variables should not be too correlated (aka multicollinearity). (will check this below)
Relationship between the variables should be linear. (will check this below)
Outliers should be identified and removed. (will check this below)
Residuals should be normally distributed and have constant variance. (will check this below)

Check Number of Cases

needed <- 80 + 8*3
nrow(d) >= needed

[1] TRUE

Run a Multiple Linear Regression

To check the following assumptions, we run our regression and then check some output and diagnostic plots BEFORE looking at our results.

# # use this commented out section only if you need to remove outliers
# # to drop a single outlier, remove the # at the beginning of the line and use this code:
# # d <- subset(d, row_id!=c(1108))
# 
# # to drop multiple outliers, remove the # at the beginning of the line and use this code:
# # d <- subset(d, row_id!=c(1108) & row_id!=c(602))
# 
# use the lm() command to run the regression
# dependent/outcome variable on the left, independent/predictor variables on the right
reg_model <- lm(stress ~ moa_independence + swb + belong, data = d)

Check multicollinearity

Higher values indicate more multicollinearity. This usually requires dropping a variable. For your homework, you will need to discuss multicollinearity and any high values, but you don’t have to drop any variables.
Cutoff is usually 5

vif(reg_model)

moa_independence              swb           belong 
        1.012145         1.033861         1.023002

Check linearity with Residuals vs Fitted plot

READ THIS TEXT

This plot (below) shows the residuals for each case and the fitted line. The red line is the average residual for the specified point of the dependent variable. If the assumption of linearity is met, the red line should be horizontal. This indicates that the residuals average to around zero. You can see that for this lab, the plot shows some non-linearity because there are more datapoints below the regression line than here are above it. Thus, there are some negative residuals that don’t have positive residuals to cancel them out. However, a bit of deviation is okay – just like with skewness and kurtosis, there’s a range that we can work in before non-normality or non-linearity becomes a critical issue. For some examples of good Residuals vs Fitted plot and ones that show serious errors, check out this page.

I created the residuals versus fitted plot for my regression model and compared it to the examples of good and bad diagnostic plots provided. My plot shows a fairly random scatter of residuals centered around zero, with no clear curvature or funnel shape. The spread of the residuals appears relatively consistent across the fitted values, which supports the assumption of homoscedasticity. While there are a few outliers, they do not indicate a serious violation of model assumptions. Based on this, I believe my plot more closely resembles the “good” examples and indicates that the model assumptions are reasonably met.

plot(reg_model, 1)

Check for outliers using Cook’s distance and a Residuals vs Leverage plot

READ THIS TEXT

The plots below both address leverage, or how much each data point is able to influence the regression line. Outliers are points that have undue influence on the regression line, the way that Bill Gates entering the room has an undue influence on the mean income.

The first plot, Cook’s distance, is a visualization of a score called (you guessed it) Cook’s distance, calculated for each case (aka row or participant) in the dataframe. Cook’s distance tells us how much the regression would change if the point was removed. The second plot also includes the residuals in the examination of leverage. The standardized residuals are on the y-axis and leverage is on the x-axis; this shows us which points have high residuals (are far from the regression line) and high leverage. Points that have large residuals and high leverage are especially worrisome, because they are far from the regression line but are also exerting a large influence on it.

After generating the Cook’s distance and the residuals versus leverage plots, I identified a few potentially influential observations. Points 458, 362, and 2429 stood out with relatively high Cook’s distances, suggesting they have a strong influence on the regression model. In the residuals versus leverage plot, these same points also showed high residuals and/or high leverage. These characteristics suggest that the points are both far from the regression line and are pulling on the model fit, which makes them outliers worth further investigation. While not extreme enough to immediately remove, I would consider running the model again without them to assess their impact.

# Cook's distance
plot(reg_model, 4)

# Residuals vs Leverage
plot(reg_model, 5)

Check homogeneity of variance in a Scale-Location plot

READ THIS TEXT

This plot is similar to the one’s we’ve seen, but it shows us the standardized residuals across the range of the regression line. Because the residuals are standarized, large residuals (whether positive or negative) are at the top of the plot, while small residuals (whether positive or negative) are at the bottom of the plot. If the assumption of homogeneity of variance (also called homoscedasticity) is met, the red line should be mostly horizontal. If it deviates from the mean line, that means that the variance is smaller or larger at that point of the regression line. Once again, you can check out this page for some other examples of this type of plot. (Notice that the Scale-Location plot is the third in the grids.)

The scale-location plot helps assess whether the residuals have constant variance across the fitted values. In my plot, the red line is mostly flat, indicating that the assumption of homoscedasticity is largely met. While there is a slight increase in variance at the higher end of the fitted values, it is not severe. Additionally, a couple of observations (64801 and 3620) stand out again as potential outliers. Overall, this plot suggests that the residual variance is fairly consistent, and the model’s assumptions are reasonably upheld.

plot(reg_model, 3)

Check normality of residuals with a Q-Q plot

READ THIS TEXT

This plot is a bit new. It’s called a Q-Q plot and shows the standardized residuals plotted against a normal distribution. If our variables are perfectly normal, the points will fit on the dashed line perfectly. This page shows how different types of non-normality appear on a Q-Q plot. It’s normal for Q-Q plots show a bit of deviation at the ends.

This page also shows some examples that help us put our Q-Q plot into context. Although it isn’t perfect, we don’t have any serious issues and are okay to proceed.

The Q-Q plot shows that most of the standardized residuals follow a straight diagonal line, indicating that the assumption of normality is reasonably met. There is some slight deviation at the tails, which is common in large datasets, but not severe enough to suggest a serious issue. Overall, the distribution of residuals appears to be approximately normal, and the model assumptions remain intact.

plot(reg_model, 2)

Issues with My Data - PART OF YOUR WRITEUP

All necessary assumptions for multiple linear regression were examined. Independence of observations was confirmed based on the design of the dataset. The sample size was sufficient, with over 3,000 observations, far exceeding the required minimum (N ≥ 104 for 3 predictors). Multicollinearity was not a concern, as all variance inflation factor (VIF) values were well below the common cutoff of 5 (VIFs ranged from 1.01 to 1.03). The Residuals vs. Fitted plot suggested no major deviations from linearity, with residuals randomly scattered around zero. The Scale-Location plot indicated homoscedasticity was reasonably met, though minor variance increases were observed at higher fitted values. Cook’s distance and residuals vs. leverage plots flagged a few influential points (e.g., observations 458, 362, and 2429), but none appeared to have undue influence warranting immediate removal. Finally, the Q-Q plot showed that residuals were approximately normally distributed, with only minor deviation at the tails. Overall, the assumptions were sufficiently met for reliable interpretation of the regression model.

View Test Output

Effect size cutoffs from Cohen (1988): * Trivial: < .1 * Small: between .1 and .3 * Medium: between .3 and .5 * Large: > .5

summary(reg_model)


Call:
lm(formula = stress ~ moa_independence + swb + belong, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9083 -0.5360 -0.0004  0.5523  3.3709 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       9.815e-17  1.495e-02   0.000    1.000    
moa_independence  1.958e-02  1.504e-02   1.302    0.193    
swb              -4.738e-01  1.520e-02 -31.173   <2e-16 ***
belong            2.343e-01  1.512e-02  15.494   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8308 on 3086 degrees of freedom
Multiple R-squared:  0.3104,    Adjusted R-squared:  0.3098 
F-statistic: 463.1 on 3 and 3086 DF,  p-value: < 2.2e-16

Write Up Results

This study examined whether markers of adulthood, satisfaction with life, and the need to belong predicted perceived stress. It was hypothesized that satisfaction with life would be negatively associated with stress, while both the need to belong and markers of adulthood would be positively associated with stress. Additionally, it was expected that satisfaction with life would be the strongest predictor, and that the effect of markers of adulthood would diminish when all predictors were considered together. The overall multiple linear regression model was significant, F(3, 3086) = 463.1, p < .001, explaining approximately 31% of the variance in perceived stress (R² = .310, Adj. R² = .310). As shown in Table 1, satisfaction with life was a significant negative predictor of perceived stress, b = -0.47, SE = 0.02, 95% CI [-0.50, -0.44], p < .001. This indicates that individuals with higher satisfaction with life reported significantly lower levels of perceived stress, representing a large effect size. The need to belong also significantly predicted perceived stress in a positive direction, b = 0.23, SE = 0.02, 95% CI [0.20, 0.26], p < .001, indicating that a stronger need to belong is associated with higher perceived stress. This reflects a medium effect size. Markers of adulthood, however, were not a significant predictor of perceived stress when the other variables were included in the model, b = 0.02, SE = 0.02, 95% CI [-0.01, 0.05], p = .193. This supports the hypothesis that the effect of markers of adulthood would diminish when accounting for satisfaction with life and the need to belong. These results suggest that subjective well-being and social needs play a stronger role in perceived stress than traditional adult role markers.

Table 1: Regression output for the relationships between markers of adulthood, satisfaction with life, need to belong, and perceived stress
	Perceived Stress Score (Stress)
Predictors	Estimates	SE	CI	p
Intercept	0.00	0.01	-0.03 – 0.03	1.000
Markers of Adulthood	0.02	0.02	-0.01 – 0.05	0.193
Satisfaction with Life (SWB)	-0.47	0.02	-0.50 – -0.44	<0.001
Need to Belong	0.23	0.02	0.20 – 0.26	<0.001
Observations	3090
R² / R² adjusted	0.310 / 0.310

References

Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic.