Introduction to Correlation

Often of interest in analyzing data is measuring the strength of assocation between two variables. This allows the analyst to answer such questions as “Does X predict Y?” or “Is X related to Y”. The most common measure of correlation is the Pearson Product Moment coefficient of correlation.

Correlation is often denoted by \(r\) and always ranges from -1 to 1. A coefficient close to 1 indicates positive correlation, ie, \(y\) increases as \(x\) increases. Correlation near -1 represents a negative correlation, or as \(y\) decreases, \(x\) increases. Values near 0 indicate weak or no correlation between the variables.

\(r\) in the Pearson setting is calcualted as:

\[ r = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}} \]

It should be noted \(r\) is not related to \(R^2\), the coefficient of determination, which is the proportion of variance in the \(X\) variable that is predicatable from the \(Y\) variable.

Correlation will be explored with the cars dataset available in R using base R functions and manual calculations to verify the understanding of the measure.

Getting Started

Load the package and data that will be used in the example.

library(ggplot2)
data("cars")

The cars dataset contains 50 observations of the stopping distance of cars at various speeds in mph. Since the speed variable is independent of the stopped distance, speed will be the \(x\) variable.

Measuring Correlation Between Two Variables

The relationship between speed and stopping distance can be visualized using a scatterplot. A linear model is drawn through the data to help represent the association of the two variables.

ggplot(cars, aes(x=speed, y=dist)) + 
  geom_point(color='blue') + 
  geom_smooth(method=lm, se=FALSE, fullrange=TRUE)

From the scatterplot, it’s readily apparent stopping distance increases as speed increases, which should be expected given the first law of motion.

Although the scatterplot shows there is indeed a positive correlation between speed and stopping distance, it is still unclear what is the strength of the relationship. The strength of association can be found using the cor.test() function available in base R. I prefer this function over cor() as it provides additional information such as the p-value and confidence intervals.

corr <- cor.test(x=cars$speed, y=cars$dist, method = 'pearson')
corr
## 
##  Pearson's product-moment correlation
## 
## data:  cars$speed and cars$dist
## t = 9.464, df = 48, p-value = 1.49e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6816422 0.8862036
## sample estimates:
##       cor 
## 0.8068949

The output shows a correlation coefficient of 0.809, confirming the previous assumption of a strong positive linear relationship. The p-value is also far below 0.05 so the null hypothesis can be rejected in favor of the alternative.

Manually Calculating Correlation

As mentioned previously, Pearson’s \(r\) is calculated as:

\[ r = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}} \]

It’s easier to define some of the recurring variables that will be used in the calcuations, namely the number of variables and observations, as well as the degrees of freedom.

vars <- length(colnames(cars))
n <- length(col(cars)) / vars
df <- n - vars
df
## [1] 48

Degrees of freedom is calculated as before in prevous examples using \(n - 2\) since two variables are evalulated.

First, find \(S_{xx}\) and \(S_{yy}\) (sum squared error of \(x\) and \(y\), respectively), as well as \(S_{xy}\) (sum of \(x\) and \(y\)).

The equations are defined as the following:

\[ S_{xy} = \sum{xy} - \frac{\sum{x}\sum{y}}{n} \] \[ S_{xx} = \sum_{i=1}^n x_i^2 - \frac{\sum{x_i}^2}{n} \] \[ S_{yy} = \sum_{i=1}^n y_i^2 - \frac{\sum{x_i}^2}{n} \]

After calculating the equations, they can be entered in the correlation equation.

# Turn into Function
sxy <- sum(cars$speed * cars$dist) - (sum(cars$speed) * sum(cars$dist)) / n 
sxx <- sum(cars$speed^2) - sum(cars$speed)^2/n
syy <- sum(cars$dist^2) - sum(cars$dist)^2/n
r <- sxy / sqrt(sxx * syy)
r
## [1] 0.8068949
cor(cars)[[2]]
## [1] 0.8068949

The t-value from the cor.test() can be calculated as:

\[ t = \frac{r\sqrt{n - 2}}{\sqrt{1 - r^2}} \]

Where \(r\) is equal to the caculated correlation coefficient.

The p-value should be calculated as below, however, I can’t seem to replicate the output given by the cor.test() function. Regardless both methods are way below 0.05 and would result in rejecting the null.

t.value <- (r * sqrt(df)) / sqrt(1-r^2)
t.value
## [1] 9.46399
corr$statistic
##       t 
## 9.46399
p.value <- pt(q = t.value, df = df, lower.tail = FALSE)
p.value
## [1] 7.449182e-13
corr$p.value
## [1] 1.489836e-12

Correlation Confidence Intervals

Confidence Intervals can be formed with Pearson’s correlation coefficient and is based on Fisher’s r-to-z transformation. To form intervals, the \(z\) is determined and then transformed back to \(r\)-space.

First, find the \(z\) value determined by:

\[ z = 0.5 \ln{\frac{1 + r}{1 - r}} \]

z <- .5 * log((1 + r) / (1 - r))

The expected value of \(z\), \(E[z]\) is found by:

\[ 0.5 \ln{\frac{1 + p}{1 - p}} \]

where \(p\) is the population correlation which \(r\) is an estimate with a standard deviation of:

\[ \sqrt{\frac{1}{n - 3}} \]

pr <- sqrt(1/(n-3))
expected.z <- .5 * log((1 + pr) / (1 - pr))

With the \(z\) and population correlation estimate found, the upper and lower limits can be found with the following:

\[ z \pm p * \sigma \]

Where \(\sigma\) is the desired confidence level.

upper <- z + pr * 1.96
lower <- z - pr * 1.96

To get the confidence interval for \(r\), the \(z\) must be transformed. Therefore, the original equation for \(z\), must be solved for \(r\).

\[ z = 0.5 \ln{\frac{1 + r}{1 - r}} \]

After solving for \(r\), the equation takes the form: (see below for step-by-step solution)

\[ r = \frac{e^{2r} - 1}{1 + e^{2r}} \]

# Turn into Function? Collect above pieces into one??
conf.upper <- (exp(2 * upper) - 1) / (exp(2 * upper) + 1)
conf.lower <- (exp(2 * lower) - 1) / (exp(2 * lower) + 1)
c(conf.lower, conf.upper)
## [1] 0.6816394 0.8862048
corr$conf.int
## [1] 0.6816422 0.8862036
## attr(,"conf.level")
## [1] 0.95

Aside: Moving to z-space to r-space

To get the equation:

\[ z = 0.5 \ln{\frac{1 + r}{1 - r}} \]

into form:

\[ r = \frac{e^{2r} - 1}{1 + e^{2r}} \]

Multiply both sides by 2 and \(e\) to get:

\[ e^{2r} = \frac{z + 1}{1 - z} \]

Multiply both sides by \(1 - z\) to arrive at:

\[ z + 1 = e^{2r}(1 - z) \]

Expand the terms on the right side:

\[ z + 1 = e^{2r} - e^{2r}z \]

And subtract \(1 - e^{2r}z\) from both sides:

\[ z(1 + e^{2r}) = e^{2r} - 1 \]

Then solve for \(z\) by dividing by \(1 + e^{2r}\) to get:

\[ z = \frac{e^{2r} - 1}{1 + e^{2r}} \]

Summary

Pearson’s correlation was explored using R’s base functions and manual calculations to verify the results. Correlation is a very useful technique to measure the strength of linear association between two variables and is a common piece of data analysis. There are other correlation coefficient statistics such as Spearman’s rank correlation and Kendall’s Tau that will be examined in the future.

References

https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

http://faculty.washington.edu/gloftus/P317-318/Useful_Information/r_to_z/PearsonrCIs.pdf