#install.packages("sjPlot")
library(psych) # for the describe() command
library(car) # for the vif() command
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
library(sjPlot) # to visualize our results
d <- read.csv(file="Data/projectdata.csv", header=T)
We hypothesize that anxiety symptoms (GAD-7) and resilience (BRS) will significantly predict depressive symptom severity (PHQ-9), and that the relationships will go in opposing directions. Specifically, those with higher anxiety symptoms will report greater depressive symptom severity, while those with higher resilience will report less depressive symptom severity.
# you only need to check the variables you're using in the current analysis
# although you checked them previously, it's always a good idea to look them over again and be sure that everything is correct
str(d)
## 'data.frame': 297 obs. of 7 variables:
## $ X : int 7888 7365 8747 7357 8760 8654 8272 8738 7911 8463 ...
## $ gender : chr "male" "female" "female" "female" ...
## $ mhealth: chr "none or NA" "none or NA" "none or NA" "none or NA" ...
## $ phq : num 2 1.78 1 1.11 1.11 ...
## $ gad : num 4 1.43 1.14 1 1 ...
## $ brs : num 2 3.83 3.83 4 4.67 ...
## $ pss : num 3.75 2.25 1.5 1.75 2.5 1.75 3 2.5 2.5 2.25 ...
# Place only continuous variables of interest in new dataframe, and name it "cont"
cont <- na.omit(subset(d, select=c(phq, gad, brs)))
cont$row_id <- 1:nrow(cont)
# Standardize all IVs
cont$gad <- scale(cont$gad, center=T, scale=T)
cont$brs <- scale(cont$brs, center=T, scale=T)
# you can use the describe() command on an entire dataframe (d) or just on a single variable
describe(cont)
## vars n mean sd median trimmed mad min max range skew
## phq 1 297 2.63 0.85 2.67 2.64 0.99 1.00 4.00 3.00 -0.04
## gad 2 297 0.00 1.00 0.12 0.02 1.16 -1.76 1.53 3.29 -0.12
## brs 3 297 0.00 1.00 -0.17 -0.01 0.87 -1.92 2.75 4.67 0.15
## row_id 4 297 149.00 85.88 149.00 149.00 109.71 1.00 297.00 296.00 0.00
## kurtosis se
## phq -1.00 0.05
## gad -1.14 0.06
## brs -0.58 0.06
## row_id -1.21 4.98
# also use histograms to examine your continuous variables (all IVs and DV)
hist(cont$gad)
hist(cont$brs)
hist(cont$phq)
# last, use scatterplots to examine each pairing of your continuous variables together
plot(cont$gad, cont$phq) # PUT YOUR DV 2ND (Y-AXIS)
plot(cont$brs, cont$phq) # PUT YOUR DV 2ND (Y-AXIS)
plot(cont$gad, cont$brs) # Check relationship between IVs, order does not matter
corr_output_m <- corr.test(cont)
corr_output_m
## Call:corr.test(x = cont)
## Correlation matrix
## phq gad brs row_id
## phq 1.00 0.78 -0.58 0.56
## gad 0.78 1.00 -0.57 0.51
## brs -0.58 -0.57 1.00 -0.44
## row_id 0.56 0.51 -0.44 1.00
## Sample Size
## [1] 297
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## phq gad brs row_id
## phq 0 0 0 0
## gad 0 0 0 0
## brs 0 0 0 0
## row_id 0 0 0 0
##
## To see confidence intervals of the correlations, print with the short=FALSE option
# CHECK FOR ANY CORRELATIONS AMONG YOUR IVs ABOVE .70 --> BAD (aka multicollinearity)
# ONLY use the commented out section below IF you need to remove outliers AFTER examining the Cook's distance and Residuals vs Leverage plots
#cont <- subset(cont, row_id!=c())
# use the lm() command to run the regression. Put DV on the left, IVs on the right separated by "+"
reg_model <- lm(phq ~ gad + brs, data = cont)
Assumptions we’ve discussed previously:
New assumptions:
needed <- 80 + 8*2
nrow(cont) >= needed
## [1] TRUE
# Variance Inflation Factor = VIF
vif(reg_model)
## gad brs
## 1.483446 1.483446
The plot below shows the residuals for each case and the fitted line. The red line is the average residual for the specified point of the dependent variable. If the assumption of linearity is met, the red line should be horizontal. This indicates that the residuals average to around zero. However, a bit of deviation is okay – just like with skewness and kurtosis, there’s a range that we can work in before non-normality becomes a critical issue. For some examples of good Residuals vs Fitted plot and ones that show serious errors, check out this page.
plot(reg_model, 1)
The plots below both address leverage, or how much each data point is able to influence the regression line. Outliers are points that have undue influence on the regression line, the way that Bill Gates entering the room has an undue influence on the mean income.
The first plot, Cook’s distance, is a visualization of a score called (you guessed it) Cook’s distance, calculated for each case (aka row or participant) in the dataframe. Cook’s distance tells us how much the regression would change if the point was removed. The second plot also includes the residuals in the examination of leverage. The standardized residuals are on the y-axis and leverage is on the x-axis; this shows us which points have high residuals (are far from the regression line) and high leverage. Points that have large residuals and high leverage are especially worrisome, because they are far from the regression line but are also exerting a large influence on it.
# Cook's distance
plot(reg_model, 4)
# Residuals vs Leverage
plot(reg_model, 5)
This plot is a bit new. It’s called a Q-Q plot and shows the standardized residuals plotted against a normal distribution. If our variables are perfectly normal, the points will fit on the dashed line perfectly. This page shows how different types of non-normality appear on a Q-Q plot.
It’s normal for Q-Q plots show a bit of deviation at the ends. This page shows some examples that help us put our Q-Q plot into context.
plot(reg_model, 2)
Before interpreting our results, we assessed our variables to see if they met the assumptions for a multiple linear regression. We detected no serious issues with linearity in a Residuals vs Fitted plot. We did not detect any outliers (by visually analyzing Cook’s Distance and Residuals vs Leverage plots) or any serious issues with the normality of our residuals (by visually analyzing a Q-Q plot), nor were there any issues of multicollinearity among our two independent variables.
summary(reg_model)
##
## Call:
## lm(formula = phq ~ gad + brs, data = cont)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.68926 -0.32916 0.00452 0.35122 1.60655
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.63000 0.02985 88.103 < 2e-16 ***
## gad 0.56352 0.03642 15.473 < 2e-16 ***
## brs -0.16599 0.03642 -4.558 7.59e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5145 on 294 degrees of freedom
## Multiple R-squared: 0.6322, Adjusted R-squared: 0.6297
## F-statistic: 252.7 on 2 and 294 DF, p-value: < 2.2e-16
# Note for section below: to type lowercase Beta below (ß) you need to hold down Alt key and type 225 on numeric keypad. If that doesn't work you should be able to copy/paste it from somewhere else
Effect size, based on Regression ß (Beta Estimate) value in our output
To test our hypothesis that anxiety symptoms and resilience would significantly predict depressive symptom severity, we used a multiple regression to model the associations between these variables. We confirmed that our data met the assumptions of a linear regression, with no violations detected.
Our hypothesis was supported. The model was statistically significant, Adj. R2 = .63, F(2, 294) = 252.70, p < .001. Our results indicate that anxiety symptoms positively predicted depressive symptom severity and had a large effect size (ß > 0.50; per Cohen, 1988), while resilience negatively predicted depressive symptom severity and had a small effect size (ß < 0.29). Full output from the regression model is reported in Table 1. This means that people’s depressive symptom severity increases by 0.56 units for every one unit increase in their anxiety symptoms, while it decreases by 0.17 units for every one unit increase in their resilience.
| Depressive Symptom Severity (PHQ-9) | ||||
|---|---|---|---|---|
| Predictors | Estimates | SE | CI | p |
| Intercept | 2.63 | 0.03 | 2.57 – 2.69 | <0.001 |
| Anxiety Symptoms (GAD-7) | 0.56 | 0.04 | 0.49 – 0.64 | <0.001 |
| Resilience (BRS) | -0.17 | 0.04 | -0.24 – -0.09 | <0.001 |
| Observations | 297 | |||
| R2 / R2 adjusted | 0.632 / 0.630 | |||
References
Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic.