1 Correlation: Loading Libraries

library(psych) # for the describe() command and the corr.test() command
library(apaTables) # to create our correlation table
library(kableExtra) # to create our correlation table
library(broom) # for the augment() command
library(ggplot2) # to visualize our results

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

library(expss)   # for the cross_cases() command

## Loading required package: maditr

## 
## To aggregate data: take(mtcars, mean_mpg = mean(mpg), by = am)

## 
## Attaching package: 'expss'

## The following object is masked from 'package:ggplot2':
## 
##     vars

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:expss':
## 
##     compute, contains, na_if, recode, vars, where

## The following objects are masked from 'package:maditr':
## 
##     between, coalesce, first, last

## The following object is masked from 'package:kableExtra':
## 
##     group_rows

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(effsize) # for the cohen.d() command

## 
## Attaching package: 'effsize'

## The following object is masked from 'package:psych':
## 
##     cohen.d

2 Importing Data

# import the dataset you cleaned previously
# this will be the dataset you'll use throughout the rest of the semester
# use ARC data downloaded previous for lab
d <- read.csv(file="EAMMi2clean.csv", header=T)
str(d)

## 'data.frame':    3182 obs. of  7 variables:
##  $ ResponseId: chr  "R_BJN3bQqi1zUMid3" "R_2TGbiBXmAtxywsD" "R_12G7bIqN2wB2N65" "R_39pldNoon8CePfP" ...
##  $ politics  : int  2 1 2 8 1 8 4 2 8 4 ...
##  $ sex       : int  2 1 1 2 1 2 2 2 2 2 ...
##  $ moa       : num  3.2 3.1 3.05 2.3 3.1 3.35 3.65 3.7 3.55 2.95 ...
##  $ swb       : num  4.33 4.17 1.83 5.17 3.67 ...
##  $ stress    : num  3.3 3.6 3.3 3.2 3.5 2.9 3.2 3 2.9 3.2 ...
##  $ mindful   : num  2.4 1.8 2.2 2.2 3.2 ...

# Check the updated data
head(data)

##                                                                             
## 1 function (..., list = character(), package = NULL, lib.loc = NULL,        
## 2     verbose = getOption("verbose"), envir = .GlobalEnv, overwrite = TRUE) 
## 3 {                                                                         
## 4     fileExt <- function(x) {                                              
## 5         db <- grepl("\\\\.[^.]+\\\\.(gz|bz2|xz)$", x)                     
## 6         ans <- sub(".*\\\\.", "", x)

3 Correlation: State Hypothesis

I predict that markers of adulthood (measured by the MOA Achievement subscales), perceived stress (measured by the Perceived Stress questionnaire), mindfulness (measured with a 15-item questionnaire), and subjective well-being (measured by the SWLS) will exhibit significant correlations with each other. Further, I hypothesize that participants who score higher in subjective well-being will also display higher levels of markers of adulthood.

4 Correlation: Check Variables

# you only need to check the variables you're using in the current analysis
# although you checked them previously, it's always a good idea to look them over again and be sure that everything is correct
str(d)

## 'data.frame':    3182 obs. of  7 variables:
##  $ ResponseId: chr  "R_BJN3bQqi1zUMid3" "R_2TGbiBXmAtxywsD" "R_12G7bIqN2wB2N65" "R_39pldNoon8CePfP" ...
##  $ politics  : int  2 1 2 8 1 8 4 2 8 4 ...
##  $ sex       : int  2 1 1 2 1 2 2 2 2 2 ...
##  $ moa       : num  3.2 3.1 3.05 2.3 3.1 3.35 3.65 3.7 3.55 2.95 ...
##  $ swb       : num  4.33 4.17 1.83 5.17 3.67 ...
##  $ stress    : num  3.3 3.6 3.3 3.2 3.5 2.9 3.2 3 2.9 3.2 ...
##  $ mindful   : num  2.4 1.8 2.2 2.2 3.2 ...

# since we're focusing on our continuous variables, we're going to subset them into their own dataframe. this will make some stuff we're doing later easier.
cont <- subset(d, select=c(moa, stress, mindful, swb))

# you can use the describe() command on an entire dataframe (d) or just on a single variable (d$pss)
describe(cont)

##         vars    n mean   sd median trimmed  mad  min max range  skew kurtosis
## moa        1 3006 3.28 0.45   3.35    3.31 0.44 1.30   4  2.70 -0.66     0.07
## stress     2 3175 3.27 0.41   3.30    3.26 0.44 1.00   5  4.00 -0.16     2.67
## mindful    3 3173 3.71 0.84   3.73    3.71 0.79 1.13   6  4.87 -0.06    -0.13
## swb        4 3178 4.47 1.32   4.67    4.53 1.48 1.00   7  6.00 -0.36    -0.46
##           se
## moa     0.01
## stress  0.01
## mindful 0.01
## swb     0.02

# our fake variable has high kurtosis, which I'll ignore. you don't need to discuss univariate normality in the results write-ups for the labs/homework, but you will need to discuss it in your final manuscript

# also use histograms to examine your continuous variables
hist(d$moa)

hist(d$stress)

hist(d$mindful)

hist(d$swb)

 # last, use scatterplots to examine your continuous variables together
plot(d$moa, d$stress)

plot(d$moa, d$mindful)

plot(d$moa, d$swb)

plot(d$stress, d$mindful)

plot(d$stress, d$swb)

plot(d$mindful, d$swb)

5 Correlation: Check Assumptions

5.1 Pearson’s Correlation Coefficient Assumptions

Should have two measurements for each participant Variables should be continuous and normally distributed Outliers should be identified and removed Relationship between the variables should be linear

5.1.1 Checking for Outliers

d <- na.omit(d)

d$moa_std <- scale(d$moa, center=T, scale=T)
hist(d$moa_std)

sum(d$moa_std < -5 | d$moa_std > 3)

## [1] 0

d$stress_std <- scale(d$stress, center=T, scale=T)
hist(d$stress_std)

sum(d$stress_std < -4 | d$stress_std > 4)

## [1] 13

d$mindful_std <- scale(d$mindful, center=T, scale=T)
hist(d$mindful_std)

sum(d$mindful_std < -4 | d$mindful_std > 3)

## [1] 0

d$swb_std <- scale(d$swb, center=T, scale=T)
hist(d$swb_std)

sum(d$swb_std < -3 | d$swb_std > 2)

## [1] 0

describe(cont)

##         vars    n mean   sd median trimmed  mad  min max range  skew kurtosis
## moa        1 3006 3.28 0.45   3.35    3.31 0.44 1.30   4  2.70 -0.66     0.07
## stress     2 3175 3.27 0.41   3.30    3.26 0.44 1.00   5  4.00 -0.16     2.67
## mindful    3 3173 3.71 0.84   3.73    3.71 0.79 1.13   6  4.87 -0.06    -0.13
## swb        4 3178 4.47 1.32   4.67    4.53 1.48 1.00   7  6.00 -0.36    -0.46
##           se
## moa     0.01
## stress  0.01
## mindful 0.01
## swb     0.02

5.2 Correlation: Issues with My Data

All continuous variables met the assumptions required for calculating Pearson’s correlation coefficient. Although perceived stress outliers have been identified, they will not be removed.

The variable for perceived stress (stress) exhibited 13 outliers. These outliers have the potential to distort the relationship between perceived stress and other variables, thereby biasing the correlation in its favor. Furthermore, non-linear associations were observed between perceived stress and the other variables.

It is important to note that the use of Pearson’s correlation coefficient r may result in underestimating the strength of a non-linear relationship and distorting the direction of the relationship. Consequently, any correlations involving perceived stress will be interpreted with caution, taking into consideration these potential limitations.

6 Create a Correlation Matrix

corr_output_m <- corr.test(cont)

7 View Test Output

corr_output_m

## Call:corr.test(x = cont)
## Correlation matrix 
##          moa stress mindful   swb
## moa     1.00   0.06    0.08  0.18
## stress  0.06   1.00   -0.25 -0.12
## mindful 0.08  -0.25    1.00  0.29
## swb     0.18  -0.12    0.29  1.00
## Sample Size 
##          moa stress mindful  swb
## moa     3006   3002    3001 3004
## stress  3002   3175    3168 3174
## mindful 3001   3168    3173 3172
## swb     3004   3174    3172 3178
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##         moa stress mindful swb
## moa       0      0       0   0
## stress    0      0       0   0
## mindful   0      0       0   0
## swb       0      0       0   0
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

8 Correlation: Results

To test my hypothesis that markers of adulthood (measured by the MOA Achievement subscales), perceived stress (measured by the Perceived Stress questionnaire), mindfulness (measured with a 15-item questionnaire), and subjective well-being (measured by the SWLS) would be correlated with one another, I conducted a series of Pearson’s correlation coefficients (see Table 1). The majority of my data met the assumptions of the test, including normality, except for the perceived stress variable (stress), which exhibited thirteen (13) outliers. Furthermore, the perceived stress variable demonstrated a moderately leptokurtic distribution, as evidenced by a high kurtosis value (2.67), indicating a distribution with a heavier tail and a more peaked shape compared to a normal distribution. Consequently, any relationships involving perceived stress with the other variables, as well as any significant findings pertaining to this variable, will be interpreted with extra care.

The correlation coefficients provided in the table partially support my hypothesis. Specifically, there is a moderate positive relationship (r = 0.18) between markers of adulthood (MOA) and subjective well-being (SWB), indicating that higher scores in markers of adulthood are associated with higher levels of subjective well-being.

Additionally, there is a moderate positive relationship (r(3182) = .29, p < .001) between mindfulness and subjective well-being, supporting the hypothesis that higher levels of mindfulness are associated with higher levels of subjective well-being.

However, the correlations between markers of adulthood and perceived stress, as well as perceived stress and subjective well-being, are weak (r(3182) = .06 and r(3182) = -.12, p’s < .001, respectively) and do not strongly support significant associations.

Furthermore, a moderate negative relationship (r(3182) = -.25, p < .001) is observed between perceived stress and mindfulness, suggesting that higher levels of perceived stress are associated with lower levels of mindfulness.

As predicted, I found that my continuous variables contained significant correlations (all ps < .001). The effect sizes of correlations ranged from weak to moderate (r’s < .5; Cohen, 1988). This test also supported my second hypothesis, that markers of adulthood would be higher in participants who score higher in subjective well-being, as can be seen by the correlation coefficients reported in Table 1.

Table 1: Means, standard deviations, and correlations with confidence intervals
Variable	M	SD	1	2	3
Markers of adulthood (MOA)	3.28	0.45

Perceived stress (stress)	3.27	0.41	.06**
			[.02, .09]

Mindfulness (mindful)	3.71	0.84	.08**	-.25**
			[.05, .12]	[-.29, -.22]

Subjective well-being (SWB)	4.47	1.32	.18**	-.12**	.29**
			[.14, .21]	[-.15, -.09]	[.25, .32]

Note:
M and SD are used to represent mean and standard deviation, respectively. Values in square brackets indicate the 95% confidence interval. The confidence interval is a plausible range of population correlations that could have caused the sample correlation.
^* indicates p < .05
^** indicates p < .01.

9 Simple Regression: Hypothesis

I hypothesize that markers of adulthood (measured by the MOA Achievement subscales) will significantly predict mindfulness (measured with a 15-item scale). Specifically, I anticipate a positive relationship, assuming that greater importance assigned to markers of adulthood will correspond with higher scores of mindfulness.

10 Simple Regression: Check Variables

# you only need to check the variables you're using in the current analysis
# although you checked them previously, it's always a good idea to look them over again and be sure that everything is correct
str(d)

## 'data.frame':    2990 obs. of  11 variables:
##  $ ResponseId : chr  "R_BJN3bQqi1zUMid3" "R_2TGbiBXmAtxywsD" "R_12G7bIqN2wB2N65" "R_39pldNoon8CePfP" ...
##  $ politics   : int  2 1 2 8 1 8 4 2 8 4 ...
##  $ sex        : int  2 1 1 2 1 2 2 2 2 2 ...
##  $ moa        : num  3.2 3.1 3.05 2.3 3.1 3.35 3.65 3.7 3.55 2.95 ...
##  $ swb        : num  4.33 4.17 1.83 5.17 3.67 ...
##  $ stress     : num  3.3 3.6 3.3 3.2 3.5 2.9 3.2 3 2.9 3.2 ...
##  $ mindful    : num  2.4 1.8 2.2 2.2 3.2 ...
##  $ moa_std    : num [1:2990, 1] -0.167 -0.39 -0.501 -2.171 -0.39 ...
##   ..- attr(*, "scaled:center")= num 3.28
##   ..- attr(*, "scaled:scale")= num 0.449
##  $ stress_std : num [1:2990, 1] 0.0791 0.8207 0.0791 -0.1681 0.5735 ...
##   ..- attr(*, "scaled:center")= num 3.27
##   ..- attr(*, "scaled:scale")= num 0.405
##  $ mindful_std: num [1:2990, 1] -1.56 -2.277 -1.799 -1.799 -0.605 ...
##   ..- attr(*, "scaled:center")= num 3.71
##   ..- attr(*, "scaled:scale")= num 0.837
##  $ swb_std    : num [1:2990, 1] -0.104 -0.23 -1.989 0.524 -0.607 ...
##   ..- attr(*, "scaled:center")= num 4.47
##   ..- attr(*, "scaled:scale")= num 1.33
##  - attr(*, "na.action")= 'omit' Named int [1:192] 21 38 39 46 48 53 126 168 190 194 ...
##   ..- attr(*, "names")= chr [1:192] "21" "38" "39" "46" ...

# you can use the describe() command on an entire dataframe (d) or just on a single variable
describe(d)

##             vars    n    mean     sd  median trimmed     mad   min     max
## ResponseId*    1 2990 1495.50 863.28 1495.50 1495.50 1108.24  1.00 2990.00
## politics       2 2990    4.19   2.24    4.00    4.08    2.97  1.00    8.00
## sex            3 2990    1.77   0.46    2.00    1.81    0.00  1.00    3.00
## moa            4 2990    3.28   0.45    3.35    3.31    0.44  1.30    4.00
## swb            5 2990    4.47   1.33    4.67    4.53    1.48  1.00    7.00
## stress         6 2990    3.27   0.40    3.30    3.26    0.44  1.00    5.00
## mindful        7 2990    3.71   0.84    3.73    3.71    0.79  1.13    6.00
## moa_std        8 2990    0.00   1.00    0.17    0.07    0.99 -4.40    1.61
## stress_std     9 2990    0.00   1.00    0.08   -0.01    1.10 -5.61    4.28
## mindful_std   10 2990    0.00   1.00    0.03    0.01    0.94 -3.07    2.74
## swb_std       11 2990    0.00   1.00    0.15    0.04    1.12 -2.62    1.91
##               range  skew kurtosis    se
## ResponseId* 2989.00  0.00    -1.20 15.79
## politics       7.00  0.44    -1.01  0.04
## sex            2.00 -0.72    -0.17  0.01
## moa            2.70 -0.66     0.06  0.01
## swb            6.00 -0.37    -0.45  0.02
## stress         4.00 -0.12     2.56  0.01
## mindful        4.87 -0.07    -0.12  0.02
## moa_std        6.01 -0.66     0.06  0.02
## stress_std     9.89 -0.12     2.56  0.02
## mindful_std    5.81 -0.07    -0.12  0.02
## swb_std        4.52 -0.37    -0.45  0.02

# also use histograms to examine your continuous variables
hist(d$moa)

hist(d$mindful)

# last, use scatterplots to examine your continuous variables together
plot(d$moa, d$mindful)

11 Run a Simple Regression

# to calculate standardized coefficients, I have to standardize our IV
d$moa_std <- scale(d$moa, center=T, scale=T)

hist(d$moa_std)

# use the lm() command to run the regression
# dependent/outcome variable on the left, independent/predictor variable on the right
reg_model <- lm(mindful ~ moa_std, data = d)

12 Simple Regression: Check Assumptions

12.1 Assumptions

12.2 Simple Regression: Plots and Residuals

model.diag.metrics <- augment(reg_model)

ggplot(model.diag.metrics, aes(x = moa_std, y = mindful)) +
  geom_point() +
  stat_smooth(method = lm, se = FALSE) +
  geom_segment(aes(xend = moa_std, yend = .fitted), color = "red", size = 0.3)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_smooth()` using formula = 'y ~ x'

12.3 Simple Regression: Linearity with Residuals vs Fitted Plot

This plot (below) shows the residuals for each case and the fitted line. The red line is the average residual for the specified point of the mindfulness variable. The plot shows some non-linearity because there are more data points above the regression line than there are below it. Thus, there are some positive residuals that don’t have negative residuals to cancel them out. However, a bit of deviation is okay – these residuals are within the range before non-normality or non-linearity becomes a critical issue.

Participants (described by row numbers 1572, 2001, 1157) have the greatest influence on the regression line, indicating the outliers within the data set.

To summarize, my plot suggests there is some minor non-linearity, but overall it represents a good graph.

plot(reg_model, 1)

12.4 Simple Regression: Checking for Outliers

The plots below both address leverage, or how much each data point is able to influence the regression line.

The first plot, Cook’s distance calculated for each case in the data frame. Cook’s distance tells us how much the regression would change if the point was removed. Ideally, I want all points to have the same influence on the regression line, although I accept that there will be some variability. The cutoff for a high Cook’s distance score is .5. For my data, some points do exert more influence than others but they’re generally equal.

The second plot also includes the residuals in the examination of leverage. The standardized residuals are on the y-axis and leverage is on the x-axis; this shows me which points have high residuals or leverage. Points that have large residuals and high leverage are especially worrisome, because they are far from the regression line but are also exerting a large influence on it. The red line indicates the average residual across points with the same amount of leverage. I want this line to stay as close to the mean/zero line as possible.

My data doesn’t have any severe outliers.

# Cook's distance
plot(reg_model, 4)

# Residuals vs Leverage
plot(reg_model, 5)

12.5 Simple Regression: Issues with My Data

Before interpreting my results, I assessed my variables to see if they met the assumptions for a simple linear regression. Analysis of a Residuals vs Fitted plot suggested that there is some minor non-linearity, but not enough to violate the assumption of linearity. I also checked Cook’s distance and a Residuals vs Leverage plot to detect outliers. Large residuals and above-average leverage identified in participants 819, 211, and 2850.

13 View Test Output

summary(reg_model)

## 
## Call:
## lm(formula = mindful ~ moa_std, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5688 -0.5336  0.0076  0.5645  2.3530 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.70606    0.01526 242.934  < 2e-16 ***
## moa_std      0.07069    0.01526   4.633 3.75e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8342 on 2988 degrees of freedom
## Multiple R-squared:  0.007133,   Adjusted R-squared:  0.006801 
## F-statistic: 21.47 on 1 and 2988 DF,  p-value: 3.754e-06

# note for section below: to type lowercase Beta below (ß) you need to hold down Alt key and type 225 on numeric keypad. If that doesn't work you should be able to copy/paste it from somewhere else

14 Simple Regression: Results

To test my hypothesis that markers of adulthood (measured by the MOA Achievement subscales) will significantly predict mindfulness (measured by a 15-item questionnaire), and that the relationship will be positive, I used a simple linear regression to model the relationship between the two continuous variables. I confirmed that my data met the assumptions of a linear regression, checking the linearity of the relationship using a Residuals vs Fitted plot and checking for outliers using Cook’s distance and a Residuals vs Leverage plot. A residual standard error of 0.8342 (df = 2988) was found.

As predicted, the coefficient for markers of adulthood is statistically significant (p < 0.001), suggesting that there is a significant relationship between markers of adulthood and mindfulness. The adjusted R² value is 0.006801, indicating that approximately 0.7% of the variance in mindfulness can be explained by the linear relationship with markers of adulthood. Also, F(2988) = 21.47, p < .001, indicating that the variation explained by the model is significantly greater than the unexplained variation.

The relationship between markers of adulthood and mindfulness was positive, ß = .07, t(3164) = 4.633, p < .001 (refer to Figure 1), suggesting a statistically significant relationship. This constitutes a large effect size (Cohen, 1988).

While the correlation between markers of adulthood and mindfulness may not be as strong as initially predicted, a statistically significant relationship between the variables is still present.

References

Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic.

Correlation & Regression HW

Luke McKissock

2023-06-06