Correlation between numerical variables

Alban Guillaumet, Troy University

Intro

When two numerical variables are associated, we say that they are correlated. For instance, brain size and body size are correlated across mammal species.

Linear correlation coefficient

Definition: The (linear, or Pearson’s) correlation coefficient \( \rho \) measures the strength and direction of the association between two numerical (continuous or discrete) variables in a population.

The correlation coefficient measures the tendency of two numerical variables to co-vary in a linear way.

The symbol \( r \) denotes a sample estimate of \( \rho \).

Sample correlation coefficient

\[ r = \frac{\sum_{i}(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sqrt{\sum_{i}(X_{i}-\bar{X})^2}\sqrt{\sum_{i}(Y_{i}-\bar{Y})^2}} \]

\[ -1 \leq r \leq 1 \]

Sample correlation coefficient

\[ r = \frac{\frac{1}{n-1}\sum_{i}(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sqrt{\frac{1}{n-1}\sum_{i}(X_{i}-\bar{X})^2}\sqrt{\frac{1}{n-1}\sum_{i}(Y_{i}-\bar{Y})^2}} \]

Sample correlation coefficient

\[ r = \frac{\mathrm{Covariance}(X,Y)}{s_{X}s_{Y}} \]

Visual examples

Correlation coefficient depends on range

Effect of measurement error on correlation

Definition: A bias called attenuation occurs with measurement error, whereby \( r \) tends to underestimate the magnitude of \( \rho \) (closer to zero on average than the true correlation).

Left: No Error; Middle: Error in \( X \); Right: Error in both

Example - Using R

Practice Problem #1

Example - Using R

In their study of hyena laughter, or “giggling”, Mathevon et al. (2010) asked whether sound spectral properties of hyenas' giggles are associated with age. The accompanying figure and data show the giggle fundamental frequency (in hertz) and the age (in years) of 16 hyenas.

Example - Using R

Read in the data and take a peek

hyenaData <- read.csv("C:/Alban/WM/TEACHING/Biostats/Alban/My class/Lectures/W&S data sets/chapter16/chap16q01HyenaGigglesAndAge.csv")
str(hyenaData)

'data.frame':   16 obs. of  2 variables:
 $ Age_years               : int  2 2 2 6 9 10 13 10 14 14 ...
 $ Fundamental_frequency_Hz: int  840 670 580 470 540 660 510 520 500 480 ...

Example - Using R

Plot in a scatter plot

plot(Fundamental_frequency_Hz ~ Age_years, data=hyenaData, cex =2.5, cex.lab=1.5, xlab = "Age (years)", ylab = "Fundamental Frequency (Hz)", pch = 19, col = "forestgreen")

plot of chunk unnamed-chunk-2

Compute correlation coefficient

cor(hyenaData$Age_years,hyenaData$Fundamental_frequency_Hz)

[1] -0.601798

hyenaCorr <- cor.test(hyenaData$Age_years, hyenaData$Fundamental_frequency_Hz)
(r <- hyenaCorr$estimate)

      cor 
-0.601798

Example - Using R

How do we quantify the uncertainty of this estimate?

We need the standard error and confidence intervals.

Standard error for r

\[ \mathrm{SE}_{r} = \sqrt{\frac{1-r^2}{n-2}} \]

n <- nrow(hyenaData)
sderr <- sqrt((1-r^2)/(n-2))
unname(sderr)

[1] 0.2134478

Confidence intervals for r

~~Long way: Don't need to learn it!~~

z <- 0.5*log((1+r)/(1-r)) # Compute z
zsderr <- sqrt(1/(n-3))   # Std err of z
zlb <- z - 1.96*zsderr    # Lower bound for z
zub <- z + 1.96*zsderr    # Upper bound for z
rlb <- (exp(2*zlb) - 1)/(exp(2*zlb) + 1)
rub <- (exp(2*zub) - 1)/(exp(2*zub) + 1)
CI <- c(rlb, rub)
names(CI) <- c("lower", "upper")
CI

     lower      upper 
-0.8453322 -0.1511871

Confidence intervals for r

Short way

hyenaCorr$conf.int

[1] -0.8453293 -0.1511968
attr(,"conf.level")
[1] 0.95

Hypothesis testing

Testing for zero correlation

\[ H_{0}: \rho = 0\\ H_{A}: \rho \neq 0 \]

Test statistic \[ t = \frac{r}{\mathrm{SE}_{r}} \]

\( t \)-distributed with \( df = n-2 \) (two parameters, \( \bar{X} \) and \( \bar{Y} \), were estimated from the data)

Hypothesis testing

~~Long way: Don't need to learn it!~~

(tstat <- hyenaCorr$estimate/sderr)

      cor 
-2.819416

pval <- 2*pt(abs(tstat), df=n-2, lower.tail=FALSE)
unname(pval)

[1] 0.01364843

Hypothesis testing

Short way

cor.test(hyenaData$Age_years, 
         hyenaData$Fundamental_frequency_Hz)

Hypothesis testing

Short way


    Pearson's product-moment correlation

data:  hyenaData$Age_years and hyenaData$Fundamental_frequency_Hz
t = -2.8194, df = 14, p-value = 0.01365
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.8453293 -0.1511968
sample estimates:
      cor 
-0.601798

Assumptions

As always, we assume the sample of individuals is a random sample from the population.

The measurements are also assumed to come from a joint probability distribution called a bivariate normal distribution ( \( p(X,Y) \) is a function of two variables):

The relationship between \( X \) and \( Y \) is linear
The cloud of points in a scatter plot has a circular or elliptical shape
The frequency distributions of \( X \) and \( Y \) separately are normal.

Bivariate normal distributions

Bivariate normal distribution with \( \rho = 0.7 \)

Bivariate normal distributions

Error in library(ggExtra) : there is no package called 'ggExtra'