Alban Guillaumet, Troy University
When two numerical variables are associated, we say that they are correlated. For instance, brain size and body size are correlated across mammal species.
Definition: The (linear, or Pearson’s)
correlation coefficient \( \rho \) measures the strength and direction of the association between two numerical (continuous or discrete) variables in a population.
The correlation coefficient measures the tendency of two numerical variables to co-vary in a linear way.
The symbol \( r \) denotes a sample estimate of \( \rho \).
\[ r = \frac{\sum_{i}(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sqrt{\sum_{i}(X_{i}-\bar{X})^2}\sqrt{\sum_{i}(Y_{i}-\bar{Y})^2}} \]
\[ -1 \leq r \leq 1 \]
\[ r = \frac{\frac{1}{n-1}\sum_{i}(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sqrt{\frac{1}{n-1}\sum_{i}(X_{i}-\bar{X})^2}\sqrt{\frac{1}{n-1}\sum_{i}(Y_{i}-\bar{Y})^2}} \]
\[ r = \frac{\mathrm{Covariance}(X,Y)}{s_{X}s_{Y}} \]
Definition: A bias called
attenuation occurs with measurement error, whereby \( r \) tends to underestimate the magnitude of \( \rho \) (closer to zero on average than the true correlation).
Practice Problem #1
In their study of hyena laughter, or “giggling”, Mathevon et al. (2010) asked whether sound spectral properties of hyenas' giggles are associated with age. The accompanying figure and data show the giggle fundamental frequency (in hertz) and the age (in years) of 16 hyenas.
Read in the data and take a peek
hyenaData <- read.csv("C:/Alban/WM/TEACHING/Biostats/Alban/My class/Lectures/W&S data sets/chapter16/chap16q01HyenaGigglesAndAge.csv")
str(hyenaData)
'data.frame': 16 obs. of 2 variables:
$ Age_years : int 2 2 2 6 9 10 13 10 14 14 ...
$ Fundamental_frequency_Hz: int 840 670 580 470 540 660 510 520 500 480 ...
Plot in a scatter plot
plot(Fundamental_frequency_Hz ~ Age_years, data=hyenaData, cex =2.5, cex.lab=1.5, xlab = "Age (years)", ylab = "Fundamental Frequency (Hz)", pch = 19, col = "forestgreen")
cor(hyenaData$Age_years,hyenaData$Fundamental_frequency_Hz)
[1] -0.601798
hyenaCorr <- cor.test(hyenaData$Age_years, hyenaData$Fundamental_frequency_Hz)
(r <- hyenaCorr$estimate)
cor
-0.601798
How do we quantify the uncertainty of this estimate?
We need the standard error and confidence intervals.
\[ \mathrm{SE}_{r} = \sqrt{\frac{1-r^2}{n-2}} \]
n <- nrow(hyenaData)
sderr <- sqrt((1-r^2)/(n-2))
unname(sderr)
[1] 0.2134478
Long way: Don't need to learn it!
z <- 0.5*log((1+r)/(1-r)) # Compute z
zsderr <- sqrt(1/(n-3)) # Std err of z
zlb <- z - 1.96*zsderr # Lower bound for z
zub <- z + 1.96*zsderr # Upper bound for z
rlb <- (exp(2*zlb) - 1)/(exp(2*zlb) + 1)
rub <- (exp(2*zub) - 1)/(exp(2*zub) + 1)
CI <- c(rlb, rub)
names(CI) <- c("lower", "upper")
CI
lower upper
-0.8453322 -0.1511871
Short way
hyenaCorr$conf.int
[1] -0.8453293 -0.1511968
attr(,"conf.level")
[1] 0.95
Testing for zero correlation
\[ H_{0}: \rho = 0\\ H_{A}: \rho \neq 0 \]
Test statistic \[ t = \frac{r}{\mathrm{SE}_{r}} \]
\( t \)-distributed with \( df = n-2 \) (two parameters, \( \bar{X} \) and \( \bar{Y} \), were estimated from the data)
Long way: Don't need to learn it!
(tstat <- hyenaCorr$estimate/sderr)
cor
-2.819416
pval <- 2*pt(abs(tstat), df=n-2, lower.tail=FALSE)
unname(pval)
[1] 0.01364843
Short way
cor.test(hyenaData$Age_years,
hyenaData$Fundamental_frequency_Hz)
Short way
Pearson's product-moment correlation
data: hyenaData$Age_years and hyenaData$Fundamental_frequency_Hz
t = -2.8194, df = 14, p-value = 0.01365
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.8453293 -0.1511968
sample estimates:
cor
-0.601798
As always, we assume the sample of individuals is a random sample from the population.
The measurements are also assumed to come from a joint probability distribution called a bivariate normal distribution ( \( p(X,Y) \) is a function of two variables):
Error in library(ggExtra) : there is no package called 'ggExtra'