Covariance and Correlation

Reference

Field, A. P., Miles, J., & Field, Z. (2012). Discovering statistics using R. London: Sage.

Expressing Relationships Between Variables

Two methods:

covariance
correlation

Covariance refers to how much two variables are associated (i.e., whether two variables covary). To understand covariance, you’ll need to understand the variance and standard deviation of a single variable. Variance or standard deviation represents the average amount the data vary from the mean. The formula for variance (i.e., square of standard deviation \(\sigma\)) is:

\[\sigma^2 = \frac{\sum_{i=1}^{n}(x_i - \overline{x})^2} {n-1}\]

\[\sigma^2 = \frac{\sum_{i=1}^{n}(x_i - \overline{x})(x_i - \overline{x})} {n-1}\]

When two variables are related, changes in one variable are met with similar changes in the other variable. Thus, when one variable deviates from its mean, the other variable should deviate in a similar way.

ads<-c(5,4,4,6,8) #number of ads on TV
sales <-c(8,9,10,13,15) #sales
salesDat <- data_frame(ads, ads-mean(ads), sales, sales-mean(sales))
salesDat

## Source: local data frame [5 x 4]
## 
##   ads ads - mean(ads) sales sales - mean(sales)
## 1   5            -0.4     8                  -3
## 2   4            -1.4     9                  -2
## 3   4            -1.4    10                  -1
## 4   6             0.6    13                   2
## 5   8             2.6    15                   4

The two variables ads and sales clearly are related. They covary because as one variables deviates from the mean in one direction, the other variable deviates from the mean in the same direction.

When there is one variable, we square the deviations to get variance:

\[\sigma^2 = \frac{\sum_{i=1}^{n}(x_i - \overline{x})(x_i - \overline{x})} {n-1}\]

When there are two variables, we mutliply the deviation for one variable by the corresponding variable for the second variable to get the cross-product deviations:

\[(x_i - \overline{x})(y_i - \overline{y})\]

then we sum the cross-product deviations,

\[\sum_{i=1}^{n}(x_i - \overline{x})(y_i - \overline{y})\]

and finally we average the sum of all cross-product deviations to get the covariance cov(x, y):

\[ cov(x, y) = \frac{\sum_{i=1}^{n}(x_i - \overline{x})(y_i - \overline{y})} {n-1} \]

To compute covariance in R:

salesDat %>%
    select(devAds = 2, devSales = 4) %>%
    mutate(crossProdDev = devAds * devSales, #calculate cross-product deviations
           covariance = sum(crossProdDev / (5-1))) #sum all cross-product deviations

## Source: local data frame [5 x 4]
## 
##   devAds devSales crossProdDev covariance
## 1   -0.4       -3          1.2       4.25
## 2   -1.4       -2          2.8       4.25
## 3   -1.4       -1          1.4       4.25
## 4    0.6        2          1.2       4.25
## 5    2.6        4         10.4       4.25

A positive covariance indicates that as one variable deviates from the mean, the other variables deviates in the same direction. A negative covariance indicates that as one variable deviates from the mean (e.g., increases), the other variable deviates in the opposite direction (e.g., decreases).

However, the size of the covariance dependson the scale of measurement. Larger scale units will lead to larger covariance. To overcome the problem of dependence on measurement scale, we need to convert convariance to a standard set of units through standardisation by dividing the covariance by the standard deviation (i.e., similar to how we compute z-scores).

With two variables, here are two standard deviations. We simply multiply the two standard deviations \(\sigma_{x}*\sigma_{y}\). We divide the covariance by the product of the two standard deviations to get the standardised covariance, which is know n as a correlation coefficient r:

\[ r = \frac{\sum_{i=1}^{n}(x_i - \overline{x})(y_i - \overline{y})} {(n-1)(\sigma_{x}\sigma_{y})} \]

r is also known as the Pearson product-movement correlation coefficient. To get r, simply divide the covariance (4.25) with the product of the two SDs:

(pearsonR <- 4.25 / (sd(salesDat$ads) * sd(salesDat$sales)))

## [1] 0.8711651

Properties of r:

range from -1 to + 1
+1 indicates perfect linear relationship
-1 indicates perfect negative relationship
0 indicates no linear relationship
± .1 represents small effect
± .3 represents medium effect
± .5 represents large effect

Testing Statistical Significance of r:

z-score method
t-statistic method

z-scores are useful when the distribution of a variable is normal. However, r doesn’t have a normally distributed sampling distribution. Fisher (1921) came up with a solution to compute an adjusted r:

\[z_{r} = \frac{1}{2} log_{e} (\frac{1 + r}{1 - r })\]

Thus, our adjusted r with a normal distribution will be:

log( (1 + pearsonR) / (1 - pearsonR) ) / 2

## [1] 1.337892

with a standard error of:

\[SE_{z_{r}} = \frac{1}{\sqrt{N - 3}}\]

1 / sqrt((5 - 3))

## [1] 0.7071068

To calculate the z-score:

\[z = \frac{z_{r}}{SE_{z_{r}}}\]

(z <- (log( (1 + pearsonR) / (1 - pearsonR) ) / 2) / (1 / sqrt((5 - 3))))

## [1] 1.892065

To determine p-value from z score:

pnorm(z, lower.tail = F) * 2

## [1] 0.05848227

R uses the t-statistic to determine the significance of Pearson’s r. The formula is:

\[t_{r} = \frac{r\sqrt{N - 2}}{\sqrt{1 - r^2}}\]

(tValue <- (pearsonR * sqrt(5 - 2)) / ( sqrt(1 - pearsonR ^ 2) ))

## [1] 3.073181

To determine p-value:

pt(q = tValue, df = 3, lower.tail = F) * 2

## [1] 0.05442624

To determine p-value with cor.test function:

cor.test(salesDat$ads, salesDat$sales)$p.value

## [1] 0.05442624

Types of Correlations

Two types:

bivariate: correlation between two variables
partial: relationship between two variables while controlling the effect of one or more additional variables