M. Drew LaMar
April 13, 2016
Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital. - Aaron Levenstein
Do not put your faith in what statistics say until you have carefully considered what they do not say. - William W. Watt
Variables: For a correlation, our data consist of two numerical variables (continuous or discrete).
Definition: The (linear)
correlation coefficient \( \rho \) measures the strength and direction of the association between two numerical variables in a population.
The linear (Pearson) correlation coefficient measures the tendency of two numerical variables to co-vary in a linear way.
The symbol \( r \) denotes a sample estimate of \( \rho \).
\[ r = \frac{\sum_{i}(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sqrt{\sum_{i}(X_{i}-\bar{X})^2}\sqrt{\sum_{i}(Y_{i}-\bar{Y})^2}} \]
\[ -1 \leq r \leq 1 \]
\[ r = \frac{\frac{1}{n-1}\sum_{i}(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sqrt{\frac{1}{n-1}\sum_{i}(X_{i}-\bar{X})^2}\sqrt{\frac{1}{n-1}\sum_{i}(Y_{i}-\bar{Y})^2}} \]
\[ r = \frac{\mathrm{Covariance}(X,Y)}{s_{X}s_{Y}} \]
Practice Problem #1
In their study of hyena laughter, or “giggling”, Mathevon et al. (2010) asked whether sound spectral properties of hyenas' giggles are associated with age. The accompanying figure and data show the giggle fundamental frequency (in hertz) and the age (in years) of 16 hyenas.
Read in the data and take a peek
hyenaData <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter16/chap16q01Hyena%20GigglesAndAge.csv")
str(hyenaData)
'data.frame': 16 obs. of 2 variables:
$ Age_years : int 2 2 2 6 9 10 13 10 14 14 ...
$ Fundamental_frequency_Hz: int 840 670 580 470 540 660 510 520 500 480 ...
Plot in a scatter plot
plot(Fundamental_frequency_Hz ~ Age_years, data=hyenaData, cex =1.5, cex.lab=1.5)
Compute correlation coefficient
hyenaCorr <- cor.test(hyenaData$Age_years, hyenaData$Fundamental_frequency_Hz)
(r <- hyenaCorr$estimate)
cor
-0.601798
How do we quantify the uncertainty of this estimate?
We need the standard error and confidence intervals.
\[ \mathrm{SE}_{r} = \sqrt{\frac{1-r^2}{n-2}} \]
n <- nrow(hyenaData)
sderr <- sqrt((1-r^2)/(n-2))
unname(sderr)
[1] 0.2134478
Note: Sampling distributions for \( r \) are not normally distributed, so we shouldn't use the standard error to compute a confidence interval.
Fisher (of course) discovered an approximate confidence interval:
Long way
z <- 0.5*log((1+r)/(1-r)) # Compute z
zsderr <- sqrt(1/(n-3)) # Std err of z
zlb <- z - 1.96*zsderr # Lower bound for z
zub <- z + 1.96*zsderr # Upper bound for z
rlb <- (exp(2*zlb) - 1)/(exp(2*zlb) + 1)
rub <- (exp(2*zub) - 1)/(exp(2*zub) + 1)
CI <- c(rlb, rub)
names(CI) <- c("lower", "upper")
CI
lower upper
-0.8453322 -0.1511871
Short way
hyenaCorr$conf.int
[1] -0.8453293 -0.1511968
attr(,"conf.level")
[1] 0.95
CI
lower upper
-0.8453322 -0.1511871
Testing for zero correlation
\[ H_{0}: \rho = 0\\ H_{A}: \rho \neq 0 \]
Test statistic \[ t = \frac{r}{\mathrm{SE}_{r}} \]
\( t \)-distributed with \( df = n-2 \) (estimated two parameters from data: \( \bar{X} \) and \( \bar{Y} \))
Long way
(tstat <- hyenaCorr$estimate/sderr)
cor
-2.819416
pval <- 2*pt(abs(tstat), df=n-2, lower.tail=FALSE)
unname(pval)
[1] 0.01364843
Short way
cor.test(hyenaData$Age_years, hyenaData$Fundamental_frequency_Hz)
Short way
Pearson's product-moment correlation
data: hyenaData$Age_years and hyenaData$Fundamental_frequency_Hz
t = -2.8194, df = 14, p-value = 0.01365
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.8453293 -0.1511968
sample estimates:
cor
-0.601798
As always, we assume the sample of individuals is a random sample from the population.
The measurements are assumed to come from a joint probability distribution \( p(X,Y) \) called a bivariate normal distribution (function of two variables).
Definition: The
Spearman’s correlation coefficient is the linear correlation coefficient computed on theranks of the data.
Practice Problem #3
As human populations became more urban from prehistory to the present, disease transmission between people likely increased. Over time, this might have led to the evolution of enhanced resistance to certain diseases in setted human populations. For example, a mutation in the SLC11A1 gene in humans causes resistance to tuberculosis. Barnes et al. (2011) examined the frequency of the resistant allele in different towns and villages in Europe and Asia and compared it to how long humans had been settled in the site (“duration of settlement”). If settlement led to the evolution of greater resistance to tuberculosis, there should be a positive association between the frequency of the resistant allele and the duration of settlement.
Load the data
'data.frame': 17 obs. of 4 variables:
$ settlement : Factor w/ 17 levels "Aksum","Angkor Borei",..: 6 14 8 13 11 5 15 2 16 7 ...
$ date : Factor w/ 17 levels "100 AD","100 BC",..: 14 12 10 9 4 17 16 11 2 13 ...
$ duration : int 8010 5260 4735 4010 3710 2810 2730 2310 2110 1955 ...
$ alleleFrequency: num 0.99 1 0.948 0.885 0.947 0.854 1 0.769 0.956 0.979 ...
Plot the data
Compute Spearman rank correlation coefficient
cor.test(TBData$duration, TBData$alleleFrequency, method="spearman")
Spearman's rank correlation rho
data: TBData$duration and TBData$alleleFrequency
S = 232.64, p-value = 0.001258
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.714899
Definition: A bias called
attenuation occurs with measurement error, whereby \( r \) tends to underestimate the magnitude of \( \rho \) (closer to zero on average than the true correlation).