Correlation Lab

Author

Cassidy Cerny

Loading Libraries

library(psych) # for the describe() command and the corr.test() command
library(apaTables) # to create our correlation table
library(kableExtra) # to create our correlation table

Importing Data

d <- read.csv(file="Data/mydata.csv", header=T)

# since we're focusing on our continuous variables, we're going to drop our categorical variables. this will make some stuff we're doing later easier.
d <- subset(d, select=-c(age, gender))

State Your Hypotheses - PART OF YOUR WRITEUP

We predict that social media use, narcissistic traits, and perceived stress will be positively correlated, and all three variables will be positively correlated with interpersonal exploitative behaviors.

Check Your Assumptions

Pearson’s Correlation Coefficient Assumptions

Should have two measurements for each participant for each variable (confirmed by earlier procedures – we dropped any participants with missing data)
Variables should be continuous and normally distributed, or assessments of the relationship may be inaccurate (will do below)
Outliers should be identified and removed, or results will be inaccurate (will do below)
Relationship between the variables should be linear, or they will not be detected (will do below)

Checking for Outliers

Outliers can mask potential effects and cause Type II error (you assume there is no relationship when there really is one, e.g., false negative).

Note: You are not required to screen out outliers or take any action based on what you see here. This is something you will check and then discuss in your write-up.

# using the scale() command to standardize our variable, viewing a histogram, and then counting statistical outliers
d$npi <- scale(d$npi, center=T, scale=T)
hist(d$npi)

sum(d$npi < -3 | d$npi > 3)

[1] 0

d$socmeduse <- scale(d$socmeduse, center=T, scale=T)
hist(d$socmeduse)

sum(d$socmeduse < -3 | d$socmeduse > 3)

[1] 0

d$exploit <- scale(d$exploit, center=T, scale=T)
hist(d$exploit)

sum(d$exploit < -3 | d$exploit > 3)

[1] 24

d$stress <- scale(d$stress, center=T, scale=T)
hist(d$stress)

sum(d$stress < -3 | d$stress > 3)

[1] 0

Checking for Linear Relationships

Non-linear relationships cannot be detected by Pearson’s correlation (the type of correlation we’re doing here). This means that you may underestimate the relationship between a pair of variables if they have a non-linear relationship, and thus your understanding of what’s happening in your data will be inaccurate.

Visually check that relationships are linear and write a brief description of any potential nonlinearity. You will have to use your judgement. There are no penalties for answering ‘wrong’, so try not to stress out about it too much – just do your best.

# use scatterplots to examine your continuous variables together
plot(d$npi, d$socmeduse)

plot(d$npi, d$exploit)

plot(d$npi, d$stress)

plot(d$socmeduse, d$exploit)

plot(d$socmeduse, d$stress)

plot(d$exploit, d$stress)

Check Your Variables

describe(d)

          vars    n mean sd median trimmed  mad   min  max range  skew kurtosis
npi          1 2150    0  1  -0.39   -0.13 0.75 -0.90 2.41  3.30  1.00    -0.56
socmeduse    2 2150    0  1   0.08    0.03 0.86 -2.71 2.41  5.12 -0.30     0.21
exploit      3 2150    0  1  -0.27   -0.14 1.07 -0.99 3.35  4.34  0.96     0.38
stress       4 2150    0  1   0.06    0.00 0.99 -2.96 2.57  5.53 -0.02    -0.15
            se
npi       0.02
socmeduse 0.02
exploit   0.02
stress    0.02

# also use histograms to examine your continuous variables
hist(d$npi)

hist(d$socmeduse)

hist(d$exploit)

hist(d$stress)

Issues with My Data - PART OF YOUR WRITEUP

For outliers, all variables showed 0 outliers, except for the exploit variable, which had 24 outliers. Outliers can significantly affect statistical analyses by distorting measures like the mean and standard deviation, and potentially influencing correlations and regression results. It’s important to be aware of outliers as they may represent data entry errors, natural extreme values, or other anomalies that can impact the validity of the findings

Most of the scatterplots show linear relationships; however, the scatterplots involving the NPI (Narcissistic Personality Inventory) exhibit nonlinear relationships. This can be an issue because nonlinearity violates the assumption of linearity in statistical models. If this issue is not addressed, it could invalidate the results of correlation and regression analyses, as traditional methods like Pearson’s correlation are designed to detect linear relationships. Inaccurate results could lead to incorrect conclusions about the strength and nature of the relationships between variables.

I did not encounter issues with skewness or kurtosis, except for the NPI variable, which shows positive skew and low kurtosis. The positive skew indicates a rightward tail, while the low kurtosis suggests a flatter distribution with fewer extreme values. These issues can distort statistical analyses, as they may make the mean less representative and affect the reliability of correlation and regression results.

Run Pearson’s Correlation

There are two ways to run Pearson’s correlation in R. You can calculate each correlation one-at-a-time using multiple commands, or you can calculate them all at once and report the scores in a matrix. The matrix output can be confusing at first, but it’s more efficient. We’ll do it both ways.

Run a Single Correlation

corr_output <- corr.test(d$npi, d$socmeduse)

View Single Correlation

Strong effect: Between |0.50| and |1|
Moderate effect: Between |0.30| and |0.49|
Weak effect: Between |0.10| and |0.29|
Trivial effect: Less than |0.09|

corr_output

Call:corr.test(x = d$npi, y = d$socmeduse)
Correlation matrix 
     [,1]
[1,] 0.08
Sample Size 
[1] 2150
These are the unadjusted probability values.
  The probability values  adjusted for multiple tests are in the p.adj object. 
     [,1]
[1,]    0

 To see confidence intervals of the correlations, print with the short=FALSE option

Create a Correlation Matrix

corr_output_m <- corr.test(d)

View Test Output

Strong effect: Between |0.50| and |1|
Moderate effect: Between |0.30| and |0.49|
Weak effect: Between |0.10| and |0.29|
Trivial effect: Less than |0.09|

corr_output_m

Call:corr.test(x = d)
Correlation matrix 
            npi socmeduse exploit stress
npi        1.00      0.08    0.35  -0.05
socmeduse  0.08      1.00    0.13   0.11
exploit    0.35      0.13    1.00   0.04
stress    -0.05      0.11    0.04   1.00
Sample Size 
[1] 2150
Probability values (Entries above the diagonal are adjusted for multiple tests.) 
           npi socmeduse exploit stress
npi       0.00         0    0.00   0.07
socmeduse 0.00         0    0.00   0.00
exploit   0.00         0    0.00   0.08
stress    0.04         0    0.08   0.00

 To see confidence intervals of the correlations, print with the short=FALSE option

Write Up Results

The hypothesis for this study was that higher levels of social media use, narcissistic personality traits, and perceived stress would be positively correlated, and all three variables would be positively correlated with interpersonal exploitative behaviors. No significant issues with skewness or kurtosis were identified, except for the NPI variable, which showed positive skew and low kurtosis. The positive skew suggests that the distribution has a longer right tail, while the low kurtosis indicates a flatter distribution with fewer extreme values. These issues may distort statistical analyses by making the mean less representative of the data, potentially affecting the reliability of correlation and regression results. We also see a little bit of a positive skew and low kurtosis with the exploit variable as well. Additionally, outliers were observed in the exploit variable, with 24 outliers identified. Outliers can significantly affect statistical analyses by distorting measures like the mean and standard deviation and potentially influencing correlation and regression results. It is important to be aware of outliers, as they may represent data entry errors, natural extreme values, or other issues that can impact the validity of the findings.

Scatterplots were used to visually inspect the relationships between variables. While most scatterplots showed linear relationships, the scatterplots involving the NPI revealed nonlinear relationships. This is an important issue, as nonlinearity violates the assumption of linearity in statistical models. If not addressed, it could invalidate the results of correlation and regression analyses, as traditional methods like Pearson’s correlation are designed to detect linear relationships. Nonlinearities may lead to inaccurate results and incorrect conclusions regarding the strength and nature of the relationships between variables.

NPI was positively correlated with exploitative behaviors (r = 0.35, p < 0.01), meaning people with higher narcissistic traits tend to engage in more exploitative behaviors. NPI had a small, non-significant positive correlation with social media use (r = 0.08, p = 0.00), showing a very weak relationship. Although statistically significant due to the large sample size (n = 2150), this relationship is not meaningful. NPI was also slightly negatively correlated with stress (r = -0.05, p = 0.07), but this wasn’t significant. No significant correlations were found between social media use and the other variables, including exploitative behaviors (r = 0.13, p = 0.00) and stress (r = 0.11, p = 0.00). The correlation between NPI and exploitative behaviors supported the hypothesis, but the weak link between NPI and social media use did not.

          Variable     M   SD            1          2           3
X.          1. npi  0.00 1.00                                    
X..1                                                             
X..2  2. socmeduse -0.00 1.00        .08**                       
X..3                            [.04, .12]                       
X..4                                                             
X..5    3. exploit -0.00 1.00        .35**      .13**            
X..6                            [.32, .39] [.09, .17]            
X..7                                                             
X..8     4. stress  0.00 1.00        -.05*      .11**         .04
X..9                          [-.09, -.00] [.07, .15] [-.00, .08]
X..10

Table 1: Means, standard deviations, and correlations with confidence intervals
Variable	M	SD	1	2	3
Narcisstic Personality (npi)	0.00	1.00

Social Media Use (socmeduse)	-0.00	1.00	.08**
			[.04, .12]

Exploitativeness (exploit)	-0.00	1.00	.35**	.13**
			[.32, .39]	[.09, .17]

Percieved Stress Questionnaire (stress)	0.00	1.00	-.05*	.11**	.04
			[-.09, -.00]	[.07, .15]	[-.00, .08]

Note:
M and SD are used to represent mean and standard deviation, respectively. Values in square brackets indicate the 95% confidence interval. The confidence interval is a plausible range of population correlations that could have caused the sample correlation.
^* indicates p < .05
^** indicates p < .01.

References

Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic.