Covariance and Correlation

Rasim Muzaffer Musal

Goals

Discuss motivations for correlation and regression
Mathematical foundation for correlation
Interpret correlation
Mathematical foundation for fitting regression
Fit a simple linear regression
Interpret regression output

Motivation for correlation

Be able to interpret the strength and direction of the linear relationship between pairs of continuous variables.
- How strong is the linear relationship between X and Y and do I expect the values of
Correlation can be used to discuss possible causal pathways (carefully!!!)
Correlation can be used to formulate a regression analysis.

Motivation for regression

Does X cause (statistical) Y?
If I increase X by 1 unit how much doI expect Y to change by?
What do I expect Y to be if X is fixed to a particular value?
If applying Multiple Regression:
- When I have \(X_{1}\) in the model does \(X_{2}\) still have an effect on \(Y\).
- What is the expected value of Y when I have \(X_{1},\cdots,X_{K}\) in the model.

Correlation

Population \(\rho\) Sample \(r\) \begin{equation}=\frac{Cov(X,Y)}{\sigma_{X}\sigma_{Y}},r=\frac{Cov(X,Y)}{s_{X}s_{Y}}\end{equation}
\(-\infty <Cov(X,Y) <+\infty\) represents the quantification and direction of the linear relationship between 2 continuous variables X and Y.
Hard to interpret beyond the sign because it will have units of X multiplied with units of Y.
Beyond the calculation for correlation it is useful in its own right in more complicated models as it has less restrictions compared to correlation.

Covariance Equation

\[Cov(X,Y)=E(XY)-E(X)E(Y)\] \[Cov(X,Y)=E[(X-E(X))E(Y-E(Y))] \]

These equations are similar to variance of a random variable calculation.

\[\sigma^{2}=E(x-\mu_{X})^{2}\]

Whereas variance can not be negative, covariance can.
E is the expected value operator. If we do population related calculations E and \(\mu\) is the same. However when calculating sample covariance it will work a bit different.

Covariance calculation example

X	Y	XY	\(x-E(X)\)	\(y-E(Y)\)	\((x-E(X))(y-E(Y))\)
1	9	9	1-5=-4	9-5=4	-4 \(\times\) 4=-16
5	5	25	5-5=0	5-5=0	0 \(\times\) 0=0
9	1	9	9-5=4	1-5=-4	4\(\times\) -4=-16

\(Cov(X,Y)\) Population,Sample \[E[(x-E(X))(y-E(Y))]\]

\[\frac{-16+0+-16}{3}=-10.67,\frac{-16+0+-16}{(3-1)}=-16\]

Covariance Example in Excel

Covariance Example in R

#Setup X and Y
X=c(1,5,9)
Y=c(9,5,1)
#Sample covariance
cov(X,Y)

[1] -16

#If you change the order of the pairs of numbers in X,Y covariance does not change. It will change if the order of X or Y by itself is changed.
#Setup X and Y first set of values are changed with 3rd.
X=c(9,5,1)
Y=c(1,5,9)
cov(X,Y)

[1] -16

#Setup X and Y 
X=c(5,9,1)
Y=c(9,5,1)
#Sample covariance does change
cov(X,Y)

[1] 8

Correlation

Now that we discussed covariance let us remember \(\rho\) and \(r\).

\[\begin{equation}\rho=\frac{Cov(X,Y)}{\sigma_{X}\sigma_{Y}},r=\frac{Cov(X,Y)}{s_{X}s_{Y}}\end{equation}\]

Correlation is a way to standardize the covariance of X and Y.
Correlation is a number between -1 and 1 quantifying the strength and direction of the linear relationship between X and Y.
You can compare correlations between pairs of variables since it is unit-less.

Correlation:Calcs., Formuls.

Positive Correlation

N=100000
# Generate independent variable
  x <- rnorm(n=N, mean=0, sd=1)
# Generate the dependen variable
  y <- 0.5*x +rnorm(n=N, mean = 0, sd = 1)
cor(x,y)

[1] 0.448321

plot(x,y)

Negative Correlation

N=100000
# Generate independent variable
  x <- rnorm(n=N, mean=0, sd=1)
# Generate the dependen variable
  y <- -0.5*x +rnorm(n=N, mean = 0, sd = 1)
cor(x,y)

[1] -0.4481473

plot(x,y)

No Correlation

N=100000
# Generate independent variable
  x <- rnorm(n=N, mean=0, sd=1)
# Generate the dependen variable
  y <- 0*x +rnorm(n=N, mean = 0, sd = 1)
cor(x,y)

[1] 0.00105148

plot(x,y)

Curious Cats: Eqn

How does \(Y=0.5*X+Z\) where z is the standard normal distribution lead to a correlation of 0.45.
No it is not an approximation of 0.5.

\[Y=0.5 \times X + Z\] \[Cor(X,Y)=\frac{Cov(X,Y)}{\sigma_{X}\sigma_{Y}} \] \[Cov(X,Y)=E(XY)-E(X)E(Y)\] \[E(XY)=?\]

Curious Cats: Eqn

\[\begin{align} E(XY)=E(X \times (0.5 \times X+Z)) \\ E(XY)=E(0.5X^{2}+X \times Z)\\ E(XY)=E(0.5X^{2})+E(X \times Z)\\ E(XY)=0.5E(X^{2})+E(X \times Z) \end{align}\]

How do we find \(E(X^{2})\)?

Curious Cats: Discussion

Those of you who are practically minded might want to simulate a large number of \(X^{2}\) and compute the mean. This would provide an approximate solution and we can revert to this if we do not have an exact solution.
Can we do \(E(X)*E(X)\) to calculate \(E(X^{2})\)?

Curious Cats: Discussion

NO! This would only work if you have 2 r.v.s that are independent.
Obviously r.v. X is not independent of X. \(p(X=x|X=x)=1 \neq p(X=x)\)
Ok now think about what we know about the random variable X. \[X \sim N(0,1)\]
The r.v. X has a normal distribution with mean 0 and standard deviation 1.

Curious Cats: Discussion

In that case X has a random variable whose variance is 1 as well.
Why is this helpful? Think about the variance calculation!
\[\begin{aligned} \sigma_{X}^{2}=\frac{(x-\mu)^{2}}{N} =E(X^{2})-[E(X)]^{2} \end{aligned}\]
How does this help us? Try to work it out.

Curious Cats: Solution

\[\begin{gather*} \sigma_{X}^{2}=E(X^{2})-[E(X)]^{2}\\ \sigma_{X}=1, \sigma_{X}^{2} = 1^{2}=1\\ 1=E(X^{2})-0^{2}\\ E(XY)=0.5 \times E(X^{2})+E(X \times Z)\\ E(XY)=0.5 \times 1^{2} + E(X) \times E(Z)\\ since \quad X \perp Z\\ E(XY)=0.5+0 \times 0 = 0.5\\ Cov(X,Y) = E(XY) - E(X) \times E(Y) = 0.5-0 \times 0 = 0.5\\ \end{gather*}\]

Curious Cats: Solution

\[\begin{gather*} \rho=\frac{Cov(X,Y)}{\sigma_{X}\sigma_{Y}} \\ =\frac{0.5}{1 \times \sigma_{Y}} \\ \sigma_{Y}=\sqrt{\sigma_{Y}^{2}} \\ \sigma_{Y}^{2}= var(0.5\times X + Z)=\\ 0.5^{2}\times \sigma_{X}^{2}+\sigma_{Z}^{2}+2\times 0.5 \times Cov(X,Z)= \\ 1.25=0.25 \times 1 + 1 + 0 \\ \rho=\frac{0.5}{\sqrt{1.25}}=0.4472136 \end{gather*}\]

Discussion

1 \(\rho\) is 0.4472136, \(r\) is not.
- Why is this?
2 What does it mean for \(\rho_{X,Y}\) = 0.45
3 If you have 3 sets of correlations \(\rho_{X,Y}\) =0.45, \(\rho_{A,Y}\) = 0.75 and \(\rho_{A,X}\)=0.8 what are the conclusions you can draw?

Some Answers: 1

\(\rho\) is a population parameter whose value is calculated based on known population parameters. \(r\) is a statistic calculated from the sample values. If \(n\) is \(\infty\) \(r\) should be \(\rho\) due to \(r\) being an unbiased estimator of \(\rho\).
- tldr; if the sample size is large enough, you would expect the value of r be close to \(\rho\). However since r is calculated based on a finite sample size, it will not equal to the population parameter \(\rho\).

Some Answers: 2

If \(\rho_{X,Y}=0.45\) this means as the values of X increases (decreases) you would expect Y to also increase (decrease). Some like to assign weak, moderate, strong to values of correlation based on some cutoff points that an important statistician suggested. These are not dogma cutoff values and are subjective therefore we avoid assigning adjectives in that manner.
- tldr; I would say that X and Y have a moderate positive linear relationship but the word moderate is subjective.

Some Answers: 3

The r.v. A has a stronger correlation to Y compared to X. This does not necessarily mean A causes Y and X just is trivially correlated to Y. Data by itself can not determine causal relationships but it should drive our search in building explanations.
Y can be caused by both X and A. In fact perhaps X has a stronger relationship to Y when X and A. Perhaps the relationship of A to Y is trivial for the problem we are trying to solve (ex: city population size and number of fires in the city).

Some Answers: 3

A and X are positively correlated. The strength of \(\rho_{A,X}\) are stronger compared to the correlation of each variable compared to Y. This large value indicates if we try to build a model to explain how Y occurs, one of these variables (A,X) might be causing the other one.
None of these are given in the data but they are questions you need to think about when investigating the statistics from the data.
In the environment of automation it becomes more important to ask good questions.

Table of Correlations:Excel 1

Table of Correlations:Excel 2

- We have by default New Worksheet Ply Selected. This will create the table of correlations in a new worksheet.

Table of Correlations:Excel 3

\(r_{X,Y}=0.69\),\(r_{X,A}=0.67\),\(r_{A,C}=0.11\)
Diagonal has the value 1, perfect correlation.

Curious Cats:

Why does this table have missing rows?
- \(r_{X,Y}=r_{Y,X}\)
Why?

\[r_{X,Y}=\frac{Cov(X,Y)}{\sigma_{X}\sigma_{Y}}=r_{Y,X}=\frac{Cov(Y,X)}{\sigma_{Y}\sigma_{X}} \] \(Cov(X,Y)=E[XY]-E[X]E[Y]=E[YX]-E[Y]E[X]\)

Table of Correlations: R

#Read the dataset which has header labels (X,Y,A), 
#columns separated by tab delimited '\t'
data=read.table('C:/Users/rm84/Desktop/teaching/2333/2333_Linear_Regression_files/data.txt',header=TRUE,sep='\t')

cor(data)

          X         Y         A
X 1.0000000 0.4219863 0.6999944
Y 0.4219863 1.0000000 0.6101665
A 0.6999944 0.6101665 1.0000000

read.table function reads the tab delimited values.
Full table instead of just the diagonal and lower triangle since the cor table can be used as an object in algorithms that require the whole table.