Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital. - Aaron Levenstein
Do not put your faith in what statistics say until you have carefully considered what they do not say. - William W. Watt
Course Announcements
Reading Assignments (no quiz):
Chapter 16: Correlation
Chapter 17: Regression
Non-Homework Posted (Chapters 16-17)
PLEASE COMPLETE COURSE EVALUATIONS!
Linear correlation coefficient
Variables: For a correlation, our data consist of two numerical variables (continuous or discrete).
Definition: The (linear) correlation coefficient\(\rho\) measures the strength and direction of the association between two numerical variables in a population.
The linear (Pearson) correlation coefficient measures the tendency of two numerical variables to co-varyin a linear way.
The symbol \(r\) denotes a sample estimate of \(\rho\).
Sample correlation coefficient
\[
r = \frac{\sum_{i}(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sqrt{\sum_{i}(X_{i}-\bar{X})^2}\sqrt{\sum_{i}(Y_{i}-\bar{Y})^2}}
\]
\[-1 \leq r \leq 1\]
Sample correlation coefficient
\[
r = \frac{\frac{1}{n-1}\sum_{i}(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sqrt{\frac{1}{n-1}\sum_{i}(X_{i}-\bar{X})^2}\sqrt{\frac{1}{n-1}\sum_{i}(Y_{i}-\bar{Y})^2}}
\]
In their study of hyena laughter, or “giggling,” Mathevon et al. (2010) asked whether sound spectral properties of hyenas’ giggles are associated with age. The accompanying figure and data show the giggle fundamental frequency (in hertz) and the age (in years) of 16 hyenas.
Pearson's product-moment correlation
data: hyenaData$Age_years and hyenaData$Fundamental_frequency_Hz
t = -2.8194, df = 14, p-value = 0.01365
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.8453293 -0.1511968
sample estimates:
cor
-0.601798
Assumptions
As always, we assume the sample of individuals is a random sample from the population.
The measurements are assumed to come from a joint probability distribution \(p(X,Y)\) called a bivariate normal distribution (function of two variables).
Relationship between \(X\) and \(Y\) is linear.
Cloud of points in a scatter plot has a circular or elliptical shape.
Frequency distributions for \(X\) and \(Y\) (called marginals) separately are normal.
Bivariate normal distributions
Bivariate normal distribution with \(\rho = 0.7\)
Bivariate normal distributions
Bivariate normal distribution with \(\rho = 0.7\)
Departures from bivariate normal distributions
Handling violations of assumptions
Use a transformation on \(X\) alone, \(Y\) alone, or both.
Definition: The Spearman’s correlation coefficient is the linear correlation coefficient computed on the ranks of the data.
Practice Problem #3
Example: Spearman
As human populations became more urban from prehistory to the present, disease transmission between people likely increased. Over time, this might have led to the evolution of enhanced resistance to certain diseases in settled human populations. For example, a mutation in the SLC11A1 gene in humans causes resistance to tuberculosis. Barnes et al. (2011) examined the frequency of the resistant allele in different towns and villages in Europe and Asia and compared it to how long humans had been settled in the site (“duration of settlement”). If settlement led to the evolution of greater resistance to tuberculosis, there should be a positive association between the frequency of the resistant allele and the duration of settlement.
Spearman's rank correlation rho
data: TBData$duration and TBData$alleleFrequency
S = 232.64, p-value = 0.001258
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.714899
Correlation coefficient depends on range
Effect of measurement error on correlation
Definition: A bias called attenuation occurs with measurement error, whereby \(r\) tends to underestimate the magnitude of \(\rho\) (closer to zero on average than the true correlation).
Left: No measurement; Middle: Error in \(X\); Right: Error in both