Reference
Field, A. P., Miles, J., & Field, Z. (2012). Discovering statistics using R. London: Sage.
Two methods:
Covariance refers to how much two variables are associated (i.e., whether two variables covary). To understand covariance, you’ll need to understand the variance and standard deviation of a single variable. Variance or standard deviation represents the average amount the data vary from the mean. The formula for variance (i.e., square of standard deviation \(\sigma\)) is:
\[\sigma^2 = \frac{\sum_{i=1}^{n}(x_i - \overline{x})^2} {n-1}\]
\[\sigma^2 = \frac{\sum_{i=1}^{n}(x_i - \overline{x})(x_i - \overline{x})} {n-1}\]
When two variables are related, changes in one variable are met with similar changes in the other variable. Thus, when one variable deviates from its mean, the other variable should deviate in a similar way.
ads<-c(5,4,4,6,8) #number of ads on TV
sales <-c(8,9,10,13,15) #sales
salesDat <- data_frame(ads, ads-mean(ads), sales, sales-mean(sales))
salesDat
## Source: local data frame [5 x 4]
##
## ads ads - mean(ads) sales sales - mean(sales)
## 1 5 -0.4 8 -3
## 2 4 -1.4 9 -2
## 3 4 -1.4 10 -1
## 4 6 0.6 13 2
## 5 8 2.6 15 4
The two variables ads
and sales
clearly are related. They covary because as one variables deviates from the mean in one direction, the other variable deviates from the mean in the same direction.
When there is one variable, we square the deviations to get variance:
\[\sigma^2 = \frac{\sum_{i=1}^{n}(x_i - \overline{x})(x_i - \overline{x})} {n-1}\]
When there are two variables, we mutliply the deviation for one variable by the corresponding variable for the second variable to get the cross-product deviations:
\[(x_i - \overline{x})(y_i - \overline{y})\]
then we sum the cross-product deviations,
\[\sum_{i=1}^{n}(x_i - \overline{x})(y_i - \overline{y})\]
and finally we average the sum of all cross-product deviations to get the covariance cov(x, y)
:
\[ cov(x, y) = \frac{\sum_{i=1}^{n}(x_i - \overline{x})(y_i - \overline{y})} {n-1} \]
To compute covariance in R:
salesDat %>%
select(devAds = 2, devSales = 4) %>%
mutate(crossProdDev = devAds * devSales, #calculate cross-product deviations
covariance = sum(crossProdDev / (5-1))) #sum all cross-product deviations
## Source: local data frame [5 x 4]
##
## devAds devSales crossProdDev covariance
## 1 -0.4 -3 1.2 4.25
## 2 -1.4 -2 2.8 4.25
## 3 -1.4 -1 1.4 4.25
## 4 0.6 2 1.2 4.25
## 5 2.6 4 10.4 4.25
A positive covariance indicates that as one variable deviates from the mean, the other variables deviates in the same direction. A negative covariance indicates that as one variable deviates from the mean (e.g., increases), the other variable deviates in the opposite direction (e.g., decreases).
However, the size of the covariance dependson the scale of measurement. Larger scale units will lead to larger covariance. To overcome the problem of dependence on measurement scale, we need to convert convariance to a standard set of units through standardisation by dividing the covariance by the standard deviation (i.e., similar to how we compute z-scores).
With two variables, here are two standard deviations. We simply multiply the two standard deviations \(\sigma_{x}*\sigma_{y}\). We divide the covariance by the product of the two standard deviations to get the standardised covariance, which is know n as a correlation coefficient r
:
\[ r = \frac{\sum_{i=1}^{n}(x_i - \overline{x})(y_i - \overline{y})} {(n-1)(\sigma_{x}\sigma_{y})} \]
r is also known as the Pearson product-movement correlation coefficient. To get r, simply divide the covariance (4.25) with the product of the two SDs:
(pearsonR <- 4.25 / (sd(salesDat$ads) * sd(salesDat$sales)))
## [1] 0.8711651
Properties of r:
z-scores are useful when the distribution of a variable is normal. However, r doesn’t have a normally distributed sampling distribution. Fisher (1921) came up with a solution to compute an adjusted r:
\[z_{r} = \frac{1}{2} log_{e} (\frac{1 + r}{1 - r })\]
Thus, our adjusted r with a normal distribution will be:
log( (1 + pearsonR) / (1 - pearsonR) ) / 2
## [1] 1.337892
with a standard error of:
\[SE_{z_{r}} = \frac{1}{\sqrt{N - 3}}\]
1 / sqrt((5 - 3))
## [1] 0.7071068
To calculate the z-score:
\[z = \frac{z_{r}}{SE_{z_{r}}}\]
(z <- (log( (1 + pearsonR) / (1 - pearsonR) ) / 2) / (1 / sqrt((5 - 3))))
## [1] 1.892065
To determine p-value from z score:
pnorm(z, lower.tail = F) * 2
## [1] 0.05848227
R uses the t-statistic to determine the significance of Pearson’s r. The formula is:
\[t_{r} = \frac{r\sqrt{N - 2}}{\sqrt{1 - r^2}}\]
(tValue <- (pearsonR * sqrt(5 - 2)) / ( sqrt(1 - pearsonR ^ 2) ))
## [1] 3.073181
To determine p-value:
pt(q = tValue, df = 3, lower.tail = F) * 2
## [1] 0.05442624
To determine p-value with cor.test
function:
cor.test(salesDat$ads, salesDat$sales)$p.value
## [1] 0.05442624
Two types: