Correlation in R

X! Finally! We’ve been spending so much time looking at response variables so far, and now we learn the first of the techniques that we’ll use for looking at explanatory and response variables together.

You’ll see that correlation and regression are both models that compare X and Y, but they look at different things:

Correlation looks at the strength of the relationship (the clustering) between two variables but does not assume that X causes the state of Y.
Does X change with Y?
Regression uses a linear model to predict the value of Y given a particular X. In this case we assume that the value of X causes the value of Y.
Does the value of X strongly predict the value of Y?

Correlation tries to answer a few questions:

Are two measurement variables related in a consistent, linear form?
If they are related, what is the direction of the relationship?
What is the strength of their relationship?

The strength and direction of the relationship is described by a statistic called Pearson’s r, which estimates the parameter $\rho$ (the Greek letter rho). The value of r and $\rho$ ranges from –1 to +1. Pearson’s r is a sample estimate like the sample mean or standard deviation. First we calculate r and then we test for the significance of r. Simple linear correlation tests the null hypothesis H₀: $\rho \neq 0$; r is significant if it’s sufficiently different than zero.

Simple linear correlation has some assumptions, as usual:

bivariate random sampling from the population of interest
both variables are numerical (measured at the interval or ratio scale)
both variables are approximately normally-distributed
or if they aren’t, they can be transformed to make them normal
the relationship between the two variables is linear
but transformation may remove linearity sometimes

Example of correlation in R

How birds fly (by powered flight or gliding or soaring, etc.) depends on the proportions of the parts of their wing and the size of their body. Consider a study of the ratios of bird wing and tail feathers. The data downloadable here show the relationship between the length of the feathers of the wing and the feathers of the tail (in mm) of 12 hummingbirds.

# Bring in the data

birds <- read.csv(url("https://raw.githubusercontent.com/nmccurtin/CSVfilesbiostats/master/birds.csv"))     ## use your data path

# Make a scatter plot. We set it up like Y ~ X.

plot(tail ~ wing, data = birds, pch = 16, cex = 1.5, col = "blue3",
         xlab ="Wing feather length (mm)", 
         ylab = "Tail feather length (mm)")

# pch is plot character 16, which is a filled-in circle

# cex is character expand by 1.5, making it 50% bigger than the default

Okay, so we see that generally what we call the “data ellipse” in the scatter plot has a positive slope, so we’d tentatively say that it ha a positive correlation, but let’s calculate it now:

# Do the correlation test and make it an object

birdsCor <- cor.test(birds$tail, birds$wing)

# Inspect the results of the correlation test

birdsCor

## 
##  Pearson's product-moment correlation
## 
## data:  birds$tail and birds$wing
## t = 3.2907, df = 10, p-value = 0.008141
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2509972 0.9159244
## sample estimates:
##       cor 
## 0.7210354

This output tells us on the last line that r = 0.87. It does a t test for significance and finds a highly significant P = 0.0002. Notice that it has 10 degrees of freedom: df = # of X–Y pairs – 2. It even helpfully gives the 95% CI, which obviously doesn’t include 0 because $H_0$ was rejected.

You can do cor.test without making an object, but sometimes we want to use such an object later. If you just typed cor.test(birds$tail, birds$wing) into the console, you’d get the same output without making an object.

## If you just wanted the standard error of r, you have to work backward a little:

## get the correlation coefficient and make an object

r <- birdsCor$estimate

# calculate the SE

SE <- sqrt( (1-r^2)/(nrow(birds) - 2) )  ## This was edited from the original module

# SHOW ME WHAT YOU GOT

unname(SE)

## [1] 0.2191137

Ta da!

Nonparametric rank correlation using Spearman’s r

Real life happens a lot and our data aren’t normally distributed. Consider these data comparing exam scores in chemistry courses and biology courses. Is there a significant relationship between the two?

# Bring in the data

exams <- read.csv(url("https://raw.githubusercontent.com/nmccurtin/CSVfilesbiostats/master/exams.csv"))

# What's the plot look like?

plot(bio ~ chem, data = exams, pch = 16, cex = 1.2, col = "red3",
        xlab = "Chemistry exam score", 
        ylab = "Biology exam score")

Umm…positive? Hard to say because there’s so much noise. Let’s do the test (here without making the object):

# add the Spearman argument to cor.test

cor.test(exams$bio, exams$chem, method = "spearman")

## 
##  Spearman's rank correlation rho
## 
## data:  exams$bio and exams$chem
## S = 72, p-value = 0.09579
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.5636364

Here we find that there is not a significant relationship between exam scores ($r_s$ = 0.56, P = 0.096).

By the way, having R do this is a huge improvement over how we did it in the old days. You should all pour one our for your homies who had to sort out all those ranks when they did the work by hand.

That’s it!

Now you’re set to go for correlation. Go to it!