1 Loading Libraries

library(psych) # for the describe() command and the corr.test() command
library(apaTables) # to create our correlation table
library(kableExtra) # to create our correlation table

2 Importing Data

# import the dataset you cleaned previously
# this will be the dataset you'll use throughout the rest of the semester
# use ARC data downloaded previous for lab
d <- read.csv(file="Data/arc_clean.csv", header=T)

3 State Your Hypothesis

We predict that stress (measured by the PSS-4), depression symptoms (measured by the PHQ-9), and self-esteem (measured by the RSE-10) will all be correlated with each other. Furthermore, we predict that self-esteem will be lower in participants who are higher in stress or who report more symptoms of depression.

4 Check Your Variables

# you only need to check the variables you're using in the current analysis
# although you checked them previously, it's always a good idea to look them over again and be sure that everything is correct
str(d)
## 'data.frame':    1250 obs. of  6 variables:
##  $ X           : int  1 20 30 31 33 57 68 81 86 104 ...
##  $ gender_rc   : chr  "f" "m" "f" "f" ...
##  $ ethnicity_rc: chr  "white" "white" "white" "white" ...
##  $ pss         : num  3.25 3.75 1 3.25 2 4 3.75 1.25 2.5 2.5 ...
##  $ phq         : num  1.33 3.33 1 2.33 1.11 ...
##  $ rse         : num  2.3 1.6 3.9 1.7 3.9 1.8 1.3 3.5 2.6 3 ...
# since we're focusing on our continuous variables, we're going to subset them into their own dataframe. this will make some stuff we're doing later easier.
cont <- subset(d, select=c(pss, phq, rse))

# you can use the describe() command on an entire dataframe (d) or just on a single variable (d$pss)
describe(cont)
##     vars    n mean   sd median trimmed  mad min max range  skew kurtosis   se
## pss    1 1250 2.93 0.95   3.00    2.93 1.11   1   5     4  0.09    -0.74 0.03
## phq    2 1250 2.07 0.86   1.89    1.99 0.99   1   4     3  0.64    -0.69 0.02
## rse    3 1250 2.63 0.71   2.70    2.65 0.74   1   4     3 -0.22    -0.70 0.02
# our fake variable has high kurtosis, which I'll ignore. you don't need to discuss univariate normality in the results write-ups for the labs/homework, but you will need to discuss it in your final manuscript

# also use histograms to examine your continuous variables
hist(d$pss)

hist(d$phq)

hist(d$rse)

# last, use scatterplots to examine your continuous variables together
plot(d$pss, d$phq)

plot(d$pss, d$rse)

plot(d$phq, d$rse)

5 Check Your Assumptions

5.1 Pearson’s Correlation Coefficient Assumptions

  • Should have two measurements for each participant
  • Variables should be continuous and normally distributed
  • Outliers should be identified and removed
  • Relationship between the variables should be linear

5.1.1 Checking for Outliers

Note: You are not required to screen out outliers or take any action based on what you see here. This is something you will check and then discuss in your write-up.

d$pss_std <- scale(d$pss, center=T, scale=T)
hist(d$pss_std)

sum(d$pss_std < -3 | d$pss_std > 3)
## [1] 0
d$phq_std <- scale(d$phq, center=T, scale=T)
hist(d$phq_std)

sum(d$phq_std < -3 | d$phq_std > 3)
## [1] 0
d$rse_std <- scale(d$rse, center=T, scale=T)
hist(d$rse_std)

sum(d$rse_std < -3 | d$rse_std > 3)
## [1] 0

5.2 Issues with My Data

All but one of my variables meet all of the assumptions of Pearson’s correlation coefficient. One variable, a fake measure of something fake (FKE-3) had high kurtosis (3.75) and had 31 outliers. Outliers can distort the relationship between two variables and sway the correlation in their direction. This variable also appears to have non-linear relationships with the other variables. Pearson’s r may underestimate the strength of a non-linear relationship and distort the relationship direction. Any correlations with my fake measure of fakeness should be evaluated carefully due to these risks.

6 Create a Correlation Matrix

corr_output_m <- corr.test(cont)

7 View Test Output

corr_output_m
## Call:corr.test(x = cont)
## Correlation matrix 
##       pss   phq   rse
## pss  1.00  0.75 -0.74
## phq  0.75  1.00 -0.75
## rse -0.74 -0.75  1.00
## Sample Size 
## [1] 1250
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##     pss phq rse
## pss   0   0   0
## phq   0   0   0
## rse   0   0   0
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

8 Write Up Results

To test my hypothesis that stress (measured by the PSS-4), depression symptoms (measured by the PHQ-9), and self-esteem (measured by the RSE-10) would be correlated with one another, I calculated a series of Pearson’s correlation coefficients. Most of my data met the assumptions of the test, with all variables meeting the standards of normality and no outliers.

As predicted, I found that all three variables were correlated. The effect sizes of all correlations were large (rs > .5; Cohen, 1988). This test also supported my second hypothesis, that self-esteem would be lower in participants who are higher in stress and report more symptoms of depression, as can be seen by the correlation coefficients reported in Table 1.

Table 1: Means, standard deviations, and correlations with confidence intervals
Variable M SD 1 2
Perceived stress (PSS-4) 2.93 0.95
Depression symptoms (PHQ-9) 2.07 0.86 .75**
[.72, .77]
Self-esteem (RSE-10) 2.63 0.71 -.74** -.75**
[-.76, -.71] [-.77, -.72]
Note:
M and SD are used to represent mean and standard deviation, respectively. Values in square brackets indicate the 95% confidence interval. The confidence interval is a plausible range of population correlations that could have caused the sample correlation.
* indicates p < .05
** indicates p < .01.

9 Loading Libraries

library(psych) # for the describe() command
library(broom) # for the augment() command
library(ggplot2) # to visualize our results
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

10 Importing Data

# import the dataset you cleaned previously
# this will be the dataset you'll use throughout the rest of the semester
# use EAMMi2 data
d <- read.csv(file="Data/arc_clean.csv", header=T)

11 State Your Hypothesis

I hypothesize that PSS will significantly predict PHQ, and that the relationship will be positive.

12 Check Your Variables

# you only need to check the variables you're using in the current analysis
# although you checked them previously, it's always a good idea to look them over again and be sure that everything is correct
str(d)
## 'data.frame':    1250 obs. of  6 variables:
##  $ X           : int  1 20 30 31 33 57 68 81 86 104 ...
##  $ gender_rc   : chr  "f" "m" "f" "f" ...
##  $ ethnicity_rc: chr  "white" "white" "white" "white" ...
##  $ pss         : num  3.25 3.75 1 3.25 2 4 3.75 1.25 2.5 2.5 ...
##  $ phq         : num  1.33 3.33 1 2.33 1.11 ...
##  $ rse         : num  2.3 1.6 3.9 1.7 3.9 1.8 1.3 3.5 2.6 3 ...
# you can use the describe() command on an entire dataframe (d) or just on a single variable
describe(d)
##               vars    n    mean      sd  median trimmed     mad min  max range
## X                1 1250 4718.87 2589.27 4872.00 4775.33 3342.52   1 8867  8866
## gender_rc*       2 1250    1.21    0.47    1.00    1.10    0.00   1    3     2
## ethnicity_rc*    3 1250    6.00    2.05    7.00    6.50    0.00   1    7     6
## pss              4 1250    2.93    0.95    3.00    2.93    1.11   1    5     4
## phq              5 1250    2.07    0.86    1.89    1.99    0.99   1    4     3
## rse              6 1250    2.63    0.71    2.70    2.65    0.74   1    4     3
##                skew kurtosis    se
## X             -0.14    -1.22 73.24
## gender_rc*     2.16     3.95  0.01
## ethnicity_rc* -1.76     1.38  0.06
## pss            0.09    -0.74  0.03
## phq            0.64    -0.69  0.02
## rse           -0.22    -0.70  0.02
# also use histograms to examine your continuous variables
hist(d$pss)

hist(d$phq)

# last, use scatterplots to examine your continuous variables together
plot(d$pss, d$phq)

13 Run a Simple Regression

# to calculate standardized coefficients, we have to standardize our IV
d$pss_std <- scale(d$pss, center=T, scale=T)
hist(d$pss_std)

# use the lm() command to run the regression
# dependent/outcome variable on the left, idependent/predictor variable on the right
reg_model <- lm(phq ~ pss_std, data = d)

14 Check Your Assumptions

14.1 Simple Regression Assumptions

  • Should have two measurements for each participant
  • Variables should be continuous and normally distributed
  • Outliers should be identified and removed
  • Relationship between the variables should be linear
  • Residuals should be normal and have constant variance note: we will not be evaluating whether our data meets these assumptions in this lab/homework – we’ll come back to them next week when we talk about multiple linear regression

14.2 Create plots and view residuals

model.diag.metrics <- augment(reg_model)

ggplot(model.diag.metrics, aes(x = pss_std, y = phq)) +
  geom_point() +
  stat_smooth(method = lm, se = FALSE) +
  geom_segment(aes(xend = pss_std, yend = .fitted), color = "red", size = 0.3)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'

plot(reg_model, 1)

# Cook's distance
plot(reg_model, 4)

# Residuals vs Leverage
plot(reg_model, 5)

15 View Test Output

summary(reg_model)
## 
## Call:
## lm(formula = phq ~ pss_std, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.53002 -0.39186 -0.04838  0.34727  2.56273 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.07298    0.01614  128.47   <2e-16 ***
## pss_std      0.64600    0.01614   40.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5705 on 1248 degrees of freedom
## Multiple R-squared:  0.562,  Adjusted R-squared:  0.5617 
## F-statistic:  1602 on 1 and 1248 DF,  p-value: < 2.2e-16
# note for section below: to type lowercase Beta below (ß) you need to hold down Alt key and type 225 on numeric keypad. If that doesn't work you should be able to copy/paste it from somewhere else

16 Write Up Results

To test our hypothesis that PSS will significantly predict PHQ, and that the relationship will be positive, we used a simple linear regression to model the relationship between the variables. We confirmed that our data met the assumptions of a linear regression, checking the linearity of the relationship using a Residuals vs Fitted plot and checking for outliers using Cook’s distance and a Residuals vs Leverage plot.

References

Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic.

Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic.