Correlation HW

Author

Michelle Karpinski

Loading Libraries

library(psych) # for the describe() command and the corr.test() command
library(apaTables) # to create our correlation table
library(kableExtra) # to create our correlation table

Importing Data

d <- read.csv(file="Data/mydata.csv", header=T)
# 
# since we're focusing on our continuous variables, we're going to drop our categorical variables. this will make some stuff we're doing later easier.
d <- subset(d, select=-c(edu, income))

State Your Hypotheses - PART OF YOUR WRITEUP

My Hypothesis: We predict that subjective well being, efficacy, and exploitativeness will be positively correlated, and that these three variables will all be negatively correlated with stress.

Check Your Assumptions

Pearson’s Correlation Coefficient Assumptions

Should have two measurements for each participant for each variable (confirmed by earlier procedures – we dropped any participants with missing data)
Variables should be continuous and normally distributed, or assessments of the relationship may be inaccurate (will do below)
Outliers should be identified and removed, or results will be inaccurate (will do below)
Relationship between the variables should be linear, or they will not be detected (will do below)

Checking for Outliers

Outliers can mask potential effects and cause Type II error (you assume there is no relationship when there really is one, e.g., false negative).

Note: You are not required to screen out outliers or take any action based on what you see here. This is something you will check and then discuss in your write-up.

# using the scale() command to standardize our variable, viewing a histogram, and then counting statistical outliers
#first variable
d$swb <- scale(d$swb, center=T, scale=T)
hist(d$swb)

sum(d$swb < -3 | d$swb > 3)

[1] 0

# 0!
#second variable
d$efficacy <- scale(d$efficacy, center=T, scale=T)
hist(d$efficacy)

sum(d$efficacy < -3 | d$efficacy > 3)

[1] 15

# 15 ): 
#third variable
d$exploit <- scale(d$exploit, center=T, scale=T)
hist(d$exploit)

sum(d$exploit < -3 | d$exploit > 3)

[1] 32

# 32 ):
#fourth variable
d$stress <- scale(d$stress, center=T, scale=T)
hist(d$stress)

sum(d$stress < -3 | d$stress > 3)

[1] 0

# 0!

Checking for Linear Relationships

Non-linear relationships cannot be detected by Pearson’s correlation (the type of correlation we’re doing here). This means that you may underestimate the relationship between a pair of variables if they have a non-linear relationship, and thus your understanding of what’s happening in your data will be inaccurate.

Visually check that relationships are linear and write a brief description of any potential nonlinearity. You will have to use your judgement. There are no penalties for answering ‘wrong’, so try not to stress out about it too much – just do your best.

# use scatterplots to examine your continuous variables together
plot(d$swb, d$efficacy)

#Linear (+)
plot(d$swb, d$exploit)

#Unclear, No Correlation, or Non-Linear
plot(d$swb, d$stress)

#Linear (-)
plot(d$efficacy, d$exploit)

#Linear (+) - slightly unclear
plot(d$efficacy, d$stress)

#Linear (-) - slightly unclear
plot(d$stress, d$exploit)

#Unclear, No Correlation, or Non-Linear

Check Your Variables

describe(d)

         vars    n mean sd median trimmed  mad   min  max range  skew kurtosis
swb         1 3148    0  1   0.15    0.04 1.12 -2.63 1.91  4.54 -0.36    -0.45
efficacy    2 3148    0  1  -0.06    0.01 1.00 -4.54 1.96  6.50 -0.24     0.45
exploit     3 3148    0  1  -0.28   -0.13 1.08 -1.01 3.37  4.38  0.94     0.35
stress      4 3148    0  1  -0.08   -0.01 0.99 -2.92 2.75  5.67  0.03    -0.17
           se
swb      0.02
efficacy 0.02
exploit  0.02
stress   0.02

# also use histograms to examine your continuous variables
hist(d$swb)

hist(d$efficacy)

hist(d$exploit)

hist(d$stress)

Issues with My Data - PART OF YOUR WRITEUP

Our data contradicted one of the assumptions of the statistical test we used; there were a number of outliers discovered. For efficacy, 15 outliers were found, and, for exploitativeness, 32 outliers were found.

However, outside of these issues, when plotted, our variables’ relationships showed no clear indications of nonlinearity. In addition, skew and kurtosis for all variables were within acceptable range.

Run Pearson’s Correlation

There are two ways to run Pearson’s correlation in R. You can calculate each correlation one-at-a-time using multiple commands, or you can calculate them all at once and report the scores in a matrix. The matrix output can be confusing at first, but it’s more efficient. We’ll do it both ways.

Run a Single Correlation

#doesn't control for family wise error rate
corr_output <- corr.test(d$efficacy, d$stress)

View Single Correlation

Strong effect: Between |0.50| and |1|

Moderate effect: Between |0.30| and |0.49|

Weak effect: Between |0.10| and |0.29|

Trivial effect: Less than |0.09|

corr_output

Call:corr.test(x = d$efficacy, y = d$stress)
Correlation matrix 
     [,1]
[1,] -0.4
Sample Size 
[1] 3148
These are the unadjusted probability values.
  The probability values  adjusted for multiple tests are in the p.adj object. 
     [,1]
[1,]    0

 To see confidence intervals of the correlations, print with the short=FALSE option

#-0.4 (sample size 3148) p < 0.001

Create a Correlation Matrix

corr_output_m <- corr.test(d)

View Test Output

Strong effect: Between |0.50| and |1|

Moderate effect: Between |0.30| and |0.49|

Weak effect: Between |0.10| and |0.29|

Trivial effect: Less than |0.09|

corr_output_m

Call:corr.test(x = d)
Correlation matrix 
           swb efficacy exploit stress
swb       1.00     0.40   -0.08  -0.50
efficacy  0.40     1.00   -0.01  -0.40
exploit  -0.08    -0.01    1.00   0.03
stress   -0.50    -0.40    0.03   1.00
Sample Size 
[1] 3148
Probability values (Entries above the diagonal are adjusted for multiple tests.) 
         swb efficacy exploit stress
swb        0     0.00    0.00   0.00
efficacy   0     0.00    0.67   0.00
exploit    0     0.67    0.00   0.14
stress     0     0.00    0.07   0.00

 To see confidence intervals of the correlations, print with the short=FALSE option

# stress and exploitativeness are not statistically significant (p = 0.14)***
# efficacy and exploitativeness are not statistically significant (p = 0.67)***
# swb & efficacy : 0.40
# swb & exploitativeness : -0.08
# swb & stress : -0.50
# efficacy & stress : -0.40
# exploitativeness & stress : 0.03 ***
# efficacy & exploitativeness : -0.01 ***

Write Up Results

We tested our hypothesis that subjective well-being, efficacy, and exploitativeness would be positively correlated, and that these three variables would be negatively correlated with stress.

There was one issue that violated the assumptions for our Pearson’s correlation test; outliers were found for efficacy and exploitativeness.

While correlation results were not statistically significant between stress and exploitativeness (p = 0.14) as well as efficacy and exploitativeness (p = 0.67), they were significant for the remaining correlations (p < 0.001) (see Table 1). Notably, the effect size for the negative correlation between stress and subjective well-being was strong, and the positive relationship between efficacy and subjective well-being as well as the negative relationship between efficacy and stress both had a moderate effect size; meanwhile, the effect size of the negative relationship between exploitativeness and subjective well-being was trivial (Cohen 1988). As such, it seems that our other variables’ relationships with exploitativeness did not support our hypothesis, but the remaining relationships did.

Table 1: Means, standard deviations, and correlations with confidence intervals
Variable	M	SD	1	2	3
Subjective Well-Being	0.00	1.00

Efficacy	-0.00	1.00	.40**
			[.37, .43]

Exploitativeness	0.00	1.00	-.08**	-.01
			[-.11, -.04]	[-.04, .03]

Stress	-0.00	1.00	-.50**	-.40**	.03
			[-.53, -.48]	[-.43, -.37]	[-.00, .07]

Note:
M and SD are used to represent mean and standard deviation, respectively. Values in square brackets indicate the 95% confidence interval. The confidence interval is a plausible range of population correlations that could have caused the sample correlation.
^* indicates p < .05
^** indicates p < .01.

References

Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic.