X! Finally! We’ve been spending so much time looking at response variables so far, and now we learn the first of the techniques that we’ll use for looking at explanatory and response variables together.
You’ll see that correlation and regression are both models that compare X and Y, but they look at different things:
Correlation tries to answer a few questions:
The strength and direction of the relationship is described by a statistic called Pearson’s r, which estimates the parameter \(\rho\) (the Greek letter rho). The value of r and \(\rho\) ranges from –1 to +1. Pearson’s r is a sample estimate like the sample mean or standard deviation. First we calculate r and then we test for the significance of r. Simple linear correlation tests the null hypothesis H0: \(\rho \neq 0\); r is significant if it’s sufficiently different than zero.
Simple linear correlation has some assumptions, as usual:
How birds fly (by powered flight or gliding or soaring, etc.) depends on the proportions of the parts of their wing and the size of their body. Consider a study of the ratios of bird wing and tail feathers. The data downloadable here show the relationship between the length of the feathers of the wing and the feathers of the tail (in mm) of 12 hummingbirds.
# Bring in the data
birds <- read.csv(url("https://raw.githubusercontent.com/nmccurtin/CSVfilesbiostats/master/birds.csv")) ## use your data path
# Make a scatter plot. We set it up like Y ~ X.
plot(tail ~ wing, data = birds, pch = 16, cex = 1.5, col = "blue3",
xlab ="Wing feather length (mm)",
ylab = "Tail feather length (mm)")
# pch is plot character 16, which is a filled-in circle
# cex is character expand by 1.5, making it 50% bigger than the default
Okay, so we see that generally what we call the “data ellipse” in the scatter plot has a positive slope, so we’d tentatively say that it ha a positive correlation, but let’s calculate it now:
# Do the correlation test and make it an object
birdsCor <- cor.test(birds$tail, birds$wing)
# Inspect the results of the correlation test
birdsCor
##
## Pearson's product-moment correlation
##
## data: birds$tail and birds$wing
## t = 3.2907, df = 10, p-value = 0.008141
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2509972 0.9159244
## sample estimates:
## cor
## 0.7210354
This output tells us on the last line that r = 0.87. It does a t test for significance and finds a highly significant P = 0.0002. Notice that it has 10 degrees of freedom: df = # of X–Y pairs – 2. It even helpfully gives the 95% CI, which obviously doesn’t include 0 because \(H_0\) was rejected.
You can do cor.test without making an object, but sometimes we want to use such an object later. If you just typed cor.test(birds$tail, birds$wing) into the console, you’d get the same output without making an object.
## If you just wanted the standard error of r, you have to work backward a little:
## get the correlation coefficient and make an object
r <- birdsCor$estimate
# calculate the SE
SE <- sqrt( (1-r^2)/(nrow(birds) - 2) ) ## This was edited from the original module
# SHOW ME WHAT YOU GOT
unname(SE)
## [1] 0.2191137
Ta da!
Real life happens a lot and our data aren’t normally distributed. Consider these data comparing exam scores in chemistry courses and biology courses. Is there a significant relationship between the two?
# Bring in the data
exams <- read.csv(url("https://raw.githubusercontent.com/nmccurtin/CSVfilesbiostats/master/exams.csv"))
# What's the plot look like?
plot(bio ~ chem, data = exams, pch = 16, cex = 1.2, col = "red3",
xlab = "Chemistry exam score",
ylab = "Biology exam score")
Umm…positive? Hard to say because there’s so much noise. Let’s do the test (here without making the object):
# add the Spearman argument to cor.test
cor.test(exams$bio, exams$chem, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: exams$bio and exams$chem
## S = 72, p-value = 0.09579
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.5636364
Here we find that there is not a significant relationship between exam scores (\(r_s\) = 0.56, P = 0.096).
By the way, having R do this is a huge improvement over how we did it in the old days. You should all pour one our for your homies who had to sort out all those ranks when they did the work by hand.
Now you’re set to go for correlation. Go to it!