Loading Libraries

library(psych) # for the describe() command and the corr.test() command
library(apaTables) # to create our correlation table
library(kableExtra) # to create our correlation table

## Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output
## %in% : 'length(x) = 2 > 1' in coercion to 'logical(1)'

Importing Data

# import the dataset you cleaned previously
# this will be the dataset you'll use throughout the rest of the semester
# use ARC data downloaded previous for lab
d <- read.csv(file="Data/arc_test.csv", header=T)

State Your Hypothesis

We predict that personality symptoms (measured by the big-5), depression symptoms (measured by the PHQ-9), worrysome covid thoughts (measured by the pas_covid), and parental support (a measure from the support_parents) will all be correlated with each other. Furthermore, we predict that depressive symtoms will be lower in participants who are higher in support_parents or who report more symptoms of neurotic personality.

Check Your Variables

# you only need to check the variables you're using in the current analysis
# although you checked them previously, it's always a good idea to look them over again and be sure that everything is correct
str(d)

## 'data.frame':    2073 obs. of  7 variables:
##  $ X              : int  1 20 30 31 32 33 48 49 57 58 ...
##  $ group          : chr  "parent" "young person" "young person" "parent" ...
##  $ gender         : chr  "female" "male" "female" "female" ...
##  $ phq            : num  1.33 3.33 1 2.33 NA ...
##  $ support_parents: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ pas_covid      : num  3.22 4.56 3.33 4.22 NA ...
##  $ big_5          : num  4.67 5.33 5.33 5.33 NA ...

# since we're focusing on our continuous variables, we're going to subset them into their own dataframe. this will make some stuff we're doing later easier.
cont <- subset(d, select=c(big_5, pas_covid, phq, support_parents))

# you can use the describe() command on an entire dataframe (d) or just on a single variable (d$pss)
describe(cont)

##                 vars    n mean   sd median trimmed  mad min max range  skew
## big_5              1 1749 4.59 1.04   4.67    4.64 0.99   1   7     6 -0.45
## pas_covid          2 1348 3.23 0.69   3.22    3.25 0.66   1   5     4 -0.18
## phq                3 1312 2.07 0.86   1.89    1.99 0.99   1   4     3  0.63
## support_parents    4  661 1.89 0.61   2.00    1.86 0.74   1   3     2  0.27
##                 kurtosis   se
## big_5              -0.08 0.02
## pas_covid           0.00 0.02
## phq                -0.69 0.02
## support_parents    -0.93 0.02

# our fake variable has high kurtosis, which I'll ignore. you don't need to discuss univariate normality in the results write-ups for the labs/homework, but you will need to discuss it in your final manuscript

# also use histograms to examine your continuous variables
hist(d$big_5)

hist(d$pas_covid)

hist(d$phq)

hist(d$support_parents)

# last, use scatterplots to examine your continuous variables together
plot(d$big_5, d$pas_covid)

plot(d$big_5, d$phq)

plot(d$big_5, d$support_parents)

plot(d$pas_covid, d$phq)

plot(d$pas_covid, d$support_parents)

plot(d$phq, d$support_parents)

Check Your Assumptions

Pearson’s Correlation Coefficient Assumptions

Should have two measurements for each participant
Variables should be continuous and normally distributed
Outliers should be identified and removed
Relationship between the variables should be linear

Checking for Outliers

Note: You are not required to screen out outliers or take any action based on what you see here. This is something you will check and then discuss in your write-up.

d$big_5_std <- scale(d$big_5, center=T, scale=T)
str(d)

## 'data.frame':    2073 obs. of  8 variables:
##  $ X              : int  1 20 30 31 32 33 48 49 57 58 ...
##  $ group          : chr  "parent" "young person" "young person" "parent" ...
##  $ gender         : chr  "female" "male" "female" "female" ...
##  $ phq            : num  1.33 3.33 1 2.33 NA ...
##  $ support_parents: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ pas_covid      : num  3.22 4.56 3.33 4.22 NA ...
##  $ big_5          : num  4.67 5.33 5.33 5.33 NA ...
##  $ big_5_std      : num [1:2073, 1] 0.0704 0.7101 0.7101 0.7101 NA ...
##   ..- attr(*, "scaled:center")= num 4.59
##   ..- attr(*, "scaled:scale")= num 1.04

hist(d$big_5_std)

sum(d$big_5_std < -3 | d$big_5_std > 3)

## [1] NA

d$pas_covid_std <- scale(d$pas_covid, center=T, scale=T)
hist(d$pas_covid_std)

sum(d$pas_covid_std < -3 | d$pas_covid_std > 3)

## [1] NA

d$phq_std <- scale(d$phq, center=T, scale=T)
hist(d$phq_std)

sum(d$phq_std < -3 | d$phq_std > 3)

## [1] NA

d$support_parents_std <- scale(d$support_parents, center=T, scale=T)
hist(d$support_parents_std)

sum(d$support_parents_std < -3 | d$support_parents_std > 3)

## [1] NA

Issues with My Data

All but one of my variables meet all of the assumptions of Pearsons correlation coefficient. Any correlations that may have had a high kurtosis or outliers should be evaluated carefully due to these risks.

Create a Correlation Matrix

corr_output_m <- corr.test(cont)

View Test Output

corr_output_m

## Call:corr.test(x = cont)
## Correlation matrix 
##                 big_5 pas_covid   phq support_parents
## big_5            1.00      0.30  0.42           -0.04
## pas_covid        0.30      1.00  0.31           -0.10
## phq              0.42      0.31  1.00           -0.47
## support_parents -0.04     -0.10 -0.47            1.00
## Sample Size 
##                 big_5 pas_covid  phq support_parents
## big_5            1749      1334 1297             615
## pas_covid        1334      1348 1273             451
## phq              1297      1273 1312             424
## support_parents   615       451  424             661
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##                 big_5 pas_covid phq support_parents
## big_5             0.0      0.00   0            0.30
## pas_covid         0.0      0.00   0            0.05
## phq               0.0      0.00   0            0.00
## support_parents   0.3      0.03   0            0.00
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

Write Up Results

To test our hypothesis that personality tendencies (measured by the Big_5), depression symptoms (measured by the PHQ-9), self-esteem (measured by the RSE-10), and support from parents ( measured by support_parents) would be correlated with one another, we calculated a series of Pearsons correlation coefficients. The data met the assumptions of the test with all variables meeting the standards of normality and no outliers. As predicted, we found that all variables were significantly correlated (all ps < .001). The effect sizes of all correlations were large (rs > .5; Cohen, 1988). This test also supported our second hypothesis, we predict that depressive symtoms will be lower in participants who are higher in support_parents or who report more symptoms of neurotic personality.

Table 1: Means, standard deviations, and correlations with confidence intervals
Variable	M	SD	1	2	3
Neurotic Personality (big_5)	4.59	1.04

Worry surrounding coovid (pas_covid	3.23	0.69	.30**
			[.25, .35]

Depression symptoms (PHQ-9)	2.07	0.86	.42**	.31**
			[.37, .46]	[.25, .35]

Parental Support (support_parents)	1.89	0.61	-.04	-.10*	-.47**
			[-.12, .04]	[-.20, -.01]	[-.54, -.39]

Note:
M and SD are used to represent mean and standard deviation, respectively. Values in square brackets indicate the 95% confidence interval. The confidence interval is a plausible range of population correlations that could have caused the sample correlation.
^* indicates p < .05
^** indicates p < .01.

References

Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic.

Loading Libraries

library(psych) # for the describe() command
library(broom) # for the augment() command
library(ggplot2) # to visualize our results

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

Importing Data

# import the dataset you cleaned previously
# this will be the dataset you'll use throughout the rest of the semester
# use EAMMi2 data
d <- read.csv(file="Data/arc_test.csv", header=T)

State Your Hypothesis

We hypothesize that parental support (measured by the support_parents) will significantly predict depressive symptoms (measured by the phq), and that the relationship will be positive.

Check Your Variables

# you only need to check the variables you're using in the current analysis
# although you checked them previously, it's always a good idea to look them over again and be sure that everything is correct
str(d)

## 'data.frame':    2073 obs. of  7 variables:
##  $ X              : int  1 20 30 31 32 33 48 49 57 58 ...
##  $ group          : chr  "parent" "young person" "young person" "parent" ...
##  $ gender         : chr  "female" "male" "female" "female" ...
##  $ phq            : num  1.33 3.33 1 2.33 NA ...
##  $ support_parents: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ pas_covid      : num  3.22 4.56 3.33 4.22 NA ...
##  $ big_5          : num  4.67 5.33 5.33 5.33 NA ...

# you can use the describe() command on an entire dataframe (d) or just on a single variable
describe(d)

##                 vars    n    mean      sd  median trimmed     mad min  max
## X                  1 2073 4982.03 2586.74 5393.00 5104.65 3106.05   1 8867
## group*             2 2073    1.71    0.45    2.00    1.76    0.00   1    2
## gender*            3 1885    1.44    0.87    1.00    1.27    0.00   1    4
## phq                4 1312    2.07    0.86    1.89    1.99    0.99   1    4
## support_parents    5  661    1.89    0.61    2.00    1.86    0.74   1    3
## pas_covid          6 1348    3.23    0.69    3.22    3.25    0.66   1    5
## big_5              7 1749    4.59    1.04    4.67    4.64    0.99   1    7
##                 range  skew kurtosis    se
## X                8866 -0.33    -1.15 56.81
## group*              1 -0.91    -1.17  0.01
## gender*             3  1.61     0.98  0.02
## phq                 3  0.63    -0.69  0.02
## support_parents     2  0.27    -0.93  0.02
## pas_covid           4 -0.18     0.00  0.02
## big_5               6 -0.45    -0.08  0.02

# also use histograms to examine your continuous variables
hist(d$support_parents)

hist(d$phq)

# last, use scatterplots to examine your continuous variables together
plot(d$support_parents, d$phq)

Run a Simple Regression

# to calculate standardized coefficients, we have to standardize our IV
d$support_parents <- scale(d$support_parents, center=T, scale=T)
hist(d$support_parents)

# use the lm() command to run the regression
# dependent/outcome variable on the left, idependent/predictor variable on the right
reg_model <- lm(phq ~ support_parents, data = d)

Check Your Assumptions

Simple Regression Assumptions

Should have two measurements for each participant
Variables should be continuous and normally distributed
Outliers should be identified and removed
Relationship between the variables should be linear
Residuals should be normal and have constant variance note: we will not be evaluating whether our data meets these assumptions in this lab/homework – we’ll come back to them next week when we talk about multiple linear regression

Create plots and view residuals

model.diag.metrics <- augment(reg_model)

ggplot(model.diag.metrics, aes(x = support_parents, y = phq)) +
  geom_point() +
  stat_smooth(method = lm, se = FALSE) +
  geom_segment(aes(xend = support_parents, yend = .fitted), color = "red", size = 0.3)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_smooth()` using formula = 'y ~ x'

Check linearity with Residuals vs Fitted plot

Our plot suggests there is some non-linearity. The plot seems to be problematic and correlate with the negative exmaples shown in the link above. While the line is straight and shows a reverse linear relationship, the data points surrounding it do not all lie around the main line.

plot(reg_model, 1)

Check for outliers

Our data does seem to have some severe outliers that lie in the red regions. This means data will need to be analyzed and factored in a different way because certain data points do not hold validity in the study. Model 4 looks good but through looking at model 5 lens we can start to see the outlier issue.

# Cook's distance
plot(reg_model, 4)

# Residuals vs Leverage
plot(reg_model, 5)

Issues with My Data

Before interpreting our results, we assessed our variables to see if they met the assumptions for a simple linear regression. Analysis of a Residuals vs Fitted plot suggested that there is some non-linearity, that seemed to be enough to violate the assumption of linearity. We also checked Cook’s distance and a Residuals vs Leverage plot to detect outliers. One case had a large residuals and above-average leverage but was below the recommended cutoff for Cook’s distance.

View Test Output

summary(reg_model)

## 
## Call:
## lm(formula = phq ~ support_parents, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2035 -0.5617  0.0063  0.5743  2.0800 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.63405    0.03703   71.14   <2e-16 ***
## support_parents -0.39366    0.03624  -10.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7613 on 422 degrees of freedom
##   (1649 observations deleted due to missingness)
## Multiple R-squared:  0.2186, Adjusted R-squared:  0.2167 
## F-statistic:   118 on 1 and 422 DF,  p-value: < 2.2e-16

# note for section below: to type lowercase Beta below (ß) you need to hold down Alt key and type 225 on numeric keypad. If that doesn't work you should be able to copy/paste it from somewhere else

Write Up Results

As predicted, we found that self-efficacy significantly predicted subjective wellbeing, Adj. R² = .22, F(1,1184) = 584.1, p < .001. The relationship between self-efficacy and subjective wellbeing was positive, ß = .76, t(3164) = 71.14, p < .001 (refer to Figure 1). According to Cohen (1988), this constitutes a large effect size (> .50).

References

Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic.

Correlation/Regression Homework

Heather Perkins

2023-06-04

Loading Libraries

Importing Data

State Your Hypothesis

Check Your Variables

Check Your Assumptions

Pearson’s Correlation Coefficient Assumptions

Checking for Outliers

Issues with My Data

Create a Correlation Matrix

View Test Output

Write Up Results

Loading Libraries

Importing Data

State Your Hypothesis

Check Your Variables

Run a Simple Regression

Check Your Assumptions

Simple Regression Assumptions

Create plots and view residuals

Check linearity with Residuals vs Fitted plot

Check for outliers

Issues with My Data

View Test Output

Write Up Results