Correlation between numerical variables

M. Drew LaMar
April 13, 2016

Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital. - Aaron Levenstein

Do not put your faith in what statistics say until you have carefully considered what they do not say. - William W. Watt

Linear correlation coefficient

Variables: For a correlation, our data consist of two numerical variables (continuous or discrete).

Definition: The (linear) correlation coefficient \( \rho \) measures the strength and direction of the association between two numerical variables in a population.

The linear (Pearson) correlation coefficient measures the tendency of two numerical variables to co-vary in a linear way.

The symbol \( r \) denotes a sample estimate of \( \rho \).

Sample correlation coefficient

\[ r = \frac{\sum_{i}(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sqrt{\sum_{i}(X_{i}-\bar{X})^2}\sqrt{\sum_{i}(Y_{i}-\bar{Y})^2}} \]

\[ -1 \leq r \leq 1 \]

Sample correlation coefficient

\[ r = \frac{\frac{1}{n-1}\sum_{i}(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sqrt{\frac{1}{n-1}\sum_{i}(X_{i}-\bar{X})^2}\sqrt{\frac{1}{n-1}\sum_{i}(Y_{i}-\bar{Y})^2}} \]

Sample correlation coefficient

\[ r = \frac{\mathrm{Covariance}(X,Y)}{s_{X}s_{Y}} \]

Visual examples

Example - Using R

Practice Problem #1

Example - Using R

In their study of hyena laughter, or “giggling”, Mathevon et al. (2010) asked whether sound spectral properties of hyenas' giggles are associated with age. The accompanying figure and data show the giggle fundamental frequency (in hertz) and the age (in years) of 16 hyenas.

Example - Using R

Read in the data and take a peek

hyenaData <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter16/chap16q01Hyena%20GigglesAndAge.csv")
str(hyenaData)

'data.frame':   16 obs. of  2 variables:
 $ Age_years               : int  2 2 2 6 9 10 13 10 14 14 ...
 $ Fundamental_frequency_Hz: int  840 670 580 470 540 660 510 520 500 480 ...

Example - Using R

Plot in a scatter plot

plot(Fundamental_frequency_Hz ~ Age_years, data=hyenaData, cex =1.5, cex.lab=1.5)

plot of chunk unnamed-chunk-2

Example - Using R

Compute correlation coefficient

hyenaCorr <- cor.test(hyenaData$Age_years, hyenaData$Fundamental_frequency_Hz)
(r <- hyenaCorr$estimate)

      cor 
-0.601798

How do we quantify the uncertainty of this estimate?

We need the standard error and confidence intervals.

Standard error for r

\[ \mathrm{SE}_{r} = \sqrt{\frac{1-r^2}{n-2}} \]

n <- nrow(hyenaData)
sderr <- sqrt((1-r^2)/(n-2))
unname(sderr)

[1] 0.2134478

Confidence intervals for r

Note: Sampling distributions for \( r \) are not normally distributed, so we shouldn't use the standard error to compute a confidence interval.

Fisher (of course) discovered an approximate confidence interval:

Compute Fisher's \( z \)-transformation: \( z = 0.5\ln\left(\frac{1+r}{1-r}\right) \)
Compute standard error for \( z \): \( \sigma_{z} = \sqrt{\frac{1}{n-3}} \)
Compute approximate 95% confidence interval: \[ z - 1.96\sigma_{z} < \zeta < z + 1.96\sigma_{z} \]
Transform lower and upper bounds back: \[ r = \frac{e^{2z}-1}{e^{2z}+1} \]

Confidence intervals for r

Long way

z <- 0.5*log((1+r)/(1-r)) # Compute z
zsderr <- sqrt(1/(n-3))   # Std err of z
zlb <- z - 1.96*zsderr    # Lower bound for z
zub <- z + 1.96*zsderr    # Upper bound for z
rlb <- (exp(2*zlb) - 1)/(exp(2*zlb) + 1)
rub <- (exp(2*zub) - 1)/(exp(2*zub) + 1)
CI <- c(rlb, rub)
names(CI) <- c("lower", "upper")
CI

     lower      upper 
-0.8453322 -0.1511871

Confidence intervals for r

Short way

hyenaCorr$conf.int

[1] -0.8453293 -0.1511968
attr(,"conf.level")
[1] 0.95

CI

     lower      upper 
-0.8453322 -0.1511871

Hypothesis testing

Testing for zero correlation

\[ H_{0}: \rho = 0\\ H_{A}: \rho \neq 0 \]

Test statistic \[ t = \frac{r}{\mathrm{SE}_{r}} \]

\( t \)-distributed with \( df = n-2 \) (estimated two parameters from data: \( \bar{X} \) and \( \bar{Y} \))

Hypothesis testing

Long way

(tstat <- hyenaCorr$estimate/sderr)

      cor 
-2.819416

pval <- 2*pt(abs(tstat), df=n-2, lower.tail=FALSE)
unname(pval)

[1] 0.01364843

Hypothesis testing

Short way

cor.test(hyenaData$Age_years, hyenaData$Fundamental_frequency_Hz)

Hypothesis testing

Short way


    Pearson's product-moment correlation

data:  hyenaData$Age_years and hyenaData$Fundamental_frequency_Hz
t = -2.8194, df = 14, p-value = 0.01365
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.8453293 -0.1511968
sample estimates:
      cor 
-0.601798

Assumptions

As always, we assume the sample of individuals is a random sample from the population.

The measurements are assumed to come from a joint probability distribution \( p(X,Y) \) called a bivariate normal distribution (function of two variables).

Relationship between \( X \) and \( Y \) is linear.
Cloud of points in a scatter plot has a circular or elliptical shape.
Frequency distributions for \( X \) and \( Y \) (called marginals) separately are normal.

Bivariate normal distributions

Bivariate normal distribution with \( \rho = 0.7 \)

Bivariate normal distributions

plot of chunk unnamed-chunk-10

Bivariate normal distribution with \( \rho = 0.7 \)

Departures from bivariate normal distributions

Handling violations of assumptions

Use a transformation on \( X \) alone, \( Y \) alone, or both.
Non-parametric techniques (Spearman rank correlation)

Spearman's rank correlation coefficient

Definition: The Spearman’s correlation coefficient is the linear correlation coefficient computed on the ranks of the data.

Practice Problem #3

Example: Spearman

As human populations became more urban from prehistory to the present, disease transmission between people likely increased. Over time, this might have led to the evolution of enhanced resistance to certain diseases in setted human populations. For example, a mutation in the SLC11A1 gene in humans causes resistance to tuberculosis. Barnes et al. (2011) examined the frequency of the resistant allele in different towns and villages in Europe and Asia and compared it to how long humans had been settled in the site (“duration of settlement”). If settlement led to the evolution of greater resistance to tuberculosis, there should be a positive association between the frequency of the resistant allele and the duration of settlement.

Example: Spearman

Load the data

'data.frame':   17 obs. of  4 variables:
 $ settlement     : Factor w/ 17 levels "Aksum","Angkor Borei",..: 6 14 8 13 11 5 15 2 16 7 ...
 $ date           : Factor w/ 17 levels "100 AD","100 BC",..: 14 12 10 9 4 17 16 11 2 13 ...
 $ duration       : int  8010 5260 4735 4010 3710 2810 2730 2310 2110 1955 ...
 $ alleleFrequency: num  0.99 1 0.948 0.885 0.947 0.854 1 0.769 0.956 0.979 ...

Example: Spearman

Plot the data

plot of chunk unnamed-chunk-12

Example: Spearman

Compute Spearman rank correlation coefficient

cor.test(TBData$duration, TBData$alleleFrequency, method="spearman")


    Spearman's rank correlation rho

data:  TBData$duration and TBData$alleleFrequency
S = 232.64, p-value = 0.001258
alternative hypothesis: true rho is not equal to 0
sample estimates:
     rho 
0.714899

Correlation coefficient depends on range

Effect of measurement error on correlation

Definition: A bias called attenuation occurs with measurement error, whereby \( r \) tends to underestimate the magnitude of \( \rho \) (closer to zero on average than the true correlation).

Left: No measurement; Middle: Error in \( X \); Right: Error in both