Correlation

Simple Correlation

Two variates X and Y are said be correlated if the increase or decrease of one variable affects the increase or decrease in another variable.

Positive Correlation:

X and Y are said to be positively correlated if Y increases if X increases and Y decreases if X decreases and vice versa. ie their both increases or both decreases.

Negative Correlation:

X and Y are said to be positively correlated if Y decreases if X increases and Y increases if X decreases and vice versa. ie their both increases or both decreases.

Zero Correlation:

If there is no change in Y with the increase of decrease in X

Scatter Diagram

If \((x_i,y_i),i=1,2,\ldots,N\) be n paired data of X and Y, by taking one variable on X axis and the other on Y axis, the point plot is a scatter diagram.

Merits: 1. It is simple to draw. 2. Give quick visualization. 3. Good starting point.

Demerits: 1. It is rough and not accurate.

x=c(84, 66, 68, 129, 90, 91, 74, 76, 70, 85, 122, 74, 104, 97, 104, 91)
y=c(86, 82, 92, 151, 96, 86, 99, 88, 87, 98, 125, 85, 127, 108, 104, 102)
plot(x,y,main='positive correlated data',pch=16)
abline(h=mean(y),col=2)
abline(v=mean(x),col=4)
points(mean(x),mean(y),pch=14)

Karl Pearson Coefficient of Correlation

The coefficient of correlation \(r_{xy}\) is \[ r_{xy} = \frac{COV(X,Y)}{\sqrt{Var(X)Var(Y)}} \]

A simple to compute formula

\[ r_{xy} = \frac{N\sum X\, Y-\sum X \, \sum Y}{ \sqrt{\left[ N\, \sum X^2 - (\sum X)^2 \right] \left[ N\, \sum Y^2 - (\sum Y)^2 \right] }} \]

Example 1 Compute the variances of X and Y. Also find covariance of X and Y to the following data

x 84 66 68 129 90 91 74 76 70 85 122 74 104 97 104 91

y 86 82 92 151 96 86 99 88 87 98 125 85 127 108 104 102

cov_xy=sum((x-mean(x))*(y-mean(y)))/length(x)

data.frame(mean(x),mean(y),cov_xy)
##   mean.x. mean.y. cov_xy
## 1 89.0625     101 291.75
# sample correlation coefficient divides (n-1) insteads of n
cov(x,y)
## [1] 311.2
## to find simple correlation

x=c(84, 66, 68, 129, 90, 91, 74, 76, 70, 85, 122, 74, 104, 97, 104, 91)
y=c(86, 82, 92, 151, 96, 86, 99, 88, 87, 98, 125, 85, 127, 108, 104, 102)

##sample size
N=length(x)


## shift the origin A=90 and B=110
A=90
B=110

u=x-A
v=y-B
uv=u*v
u2=u*u
v2=v*v


df=data.frame(x,y,u,v,uv,u2,v2)
srow=apply(df,2,sum)
df = rbind(df,srow)
df
##       x    y   u    v   uv   u2   v2
## 1    84   86  -6  -24  144   36  576
## 2    66   82 -24  -28  672  576  784
## 3    68   92 -22  -18  396  484  324
## 4   129  151  39   41 1599 1521 1681
## 5    90   96   0  -14    0    0  196
## 6    91   86   1  -24  -24    1  576
## 7    74   99 -16  -11  176  256  121
## 8    76   88 -14  -22  308  196  484
## 9    70   87 -20  -23  460  400  529
## 10   85   98  -5  -12   60   25  144
## 11  122  125  32   15  480 1024  225
## 12   74   85 -16  -25  400  256  625
## 13  104  127  14   17  238  196  289
## 14   97  108   7   -2  -14   49    4
## 15  104  104  14   -6  -84  196   36
## 16   91  102   1   -8   -8    1   64
## 17 1425 1616 -15 -144 4803 5217 6658
#product moment formula
r_uv=(N*sum(uv)-sum(u)*sum(v))/
(sqrt(N*sum(u2)-sum(u)^2)*sqrt(sum(N*v2)-sum(v)^2))

data.frame(r_uv=round(r_uv,3))  
##    r_uv
## 1 0.884
#pearsons formula

pearson_r=cov(u,v)/sqrt(var(u)*var(v))

data.frame(pearson_r=round(pearson_r,3))
##   pearson_r
## 1     0.884

The variables have strong positive correlation.

f<-function(x)  x
curve(f(x),-1,1,lwd=2,col=2,main='linear correlation')
text(0.65,0.5,'+ve r',col=2)
curve(-f(x),-1,1,lwd=2,col=4,add = TRUE,lty=2)
text(0.65,-0.50,'-ve r',col=4)

Properties of Correlation

  • Correlation coefficient is purely real no
  • It lies in -1 and +1
  • r +1 means strong positive correlation, -1 negative correlation and
  • zero means No linear correlation
  • Invariant in magnitude under linear transformation \(|r_{xy}|=|r_{uv}|\) where \(u=\frac{X-a}{b}\) and \(v=\frac{y-c}{d}\).

\[ r_{xy}=\frac{b\times d}{|b|\times |d|}\, r_{uv} \]

If X and Y are independent, then correlation of X and Y is zero and the converse not necessarily true. Below is counter example

x=c(-3, -2, -1, 1, 2,3)
y=c(9, 4, 1, 1, 4, 9) 
df=data.frame(x,y,xy=x*y,x2=x*x,y2=y*y)
df
##    x y  xy x2 y2
## 1 -3 9 -27  9 81
## 2 -2 4  -8  4 16
## 3 -1 1  -1  1  1
## 4  1 1   1  1  1
## 5  2 4   8  4 16
## 6  3 9  27  9 81
apply(df,2,sum)
##   x   y  xy  x2  y2 
##   0  28   0  28 196
print(c('correlation coefficient is 0 though y=x^2'))
## [1] "correlation coefficient is 0 though y=x^2"
rm(list=ls())

Example-2

Calculate the correlauon coefficient for ihe following heights (in inches) of father’s (X) and his sons (Y) :

X : 65 66 67 67 68 69 70 72

Y: 61 68 6S 68 12 12 69 71

Method-1

x = c(65, 66, 67, 67, 68, 69, 70, 72) 
y = c(67, 68, 65, 68, 72, 72, 69, 71)
df=data.frame(x,y,xy=x*y,x2=x*x,y2=y*y)
srow=apply(df,2,sum)
rbind(df,srow)
##     x   y    xy    x2    y2
## 1  65  67  4355  4225  4489
## 2  66  68  4488  4356  4624
## 3  67  65  4355  4489  4225
## 4  67  68  4556  4489  4624
## 5  68  72  4896  4624  5184
## 6  69  72  4968  4761  5184
## 7  70  69  4830  4900  4761
## 8  72  71  5112  5184  5041
## 9 544 552 37560 37028 38132
data.frame(srow)
##     srow
## x    544
## y    552
## xy 37560
## x2 37028
## y2 38132
N=length(x)
rxy=(N*srow['xy']-srow['x']*srow['y'])/
  sqrt((N*srow['x2']-srow['x']^2)*(N*srow['y2']-srow['y']^2))

names(rxy)<-'correlation x and y'
rxy
## correlation x and y 
##           0.6030227

Example 1

Show that the coeficient of correlation r is independent of a change of scale and origin of the variables. Also prove that for two independent variables r = O. Show by an example that the converse is not true. State the Iimits between which r lies and give its proof.

Example 2

Let r be the correlation coefficient between two jointly distributed random variables X and Y. Show that \(|r|<1\) and that \(|r|=1\) if and only if X and Y are linearly related.

Example 2

Calculate the coefficient of correlation between X and Y for the following:

X 1 3 4 5 7 8 10

Y 2 6 8 10 14 16 20

x=c(1, 3, 4, 5, 7, 8, 10)
y=c(2, 6, 8, 10, 14, 16, 20)
df=data.frame(x,y,xy=x*y,x2=x*x,y2=y*y)
df
##    x  y  xy  x2  y2
## 1  1  2   2   1   4
## 2  3  6  18   9  36
## 3  4  8  32  16  64
## 4  5 10  50  25 100
## 5  7 14  98  49 196
## 6  8 16 128  64 256
## 7 10 20 200 100 400
srow=apply(df,2,sum)
n=length(x)
r= (n*srow[3]-srow[1]*srow[2])/
sqrt((n*srow[4]-srow[1]^2)*(n*srow[5]-srow[2]^2))

names(r)<-'correlation'
r
## correlation 
##           1

Example

By effecting suitable change of origin and scale, compute product moment correlation coefficient for the following set of 5 observations on (X. Y) :

X: -10 -5 0 5 10

Y: 5 9 7 11 13

x=c(-10, -5, 0, 5, 10) 
y=c(5, 9, 7, 11, 13)
A=0
B=7
u=x-A
v=y-B
df=data.frame(x,y,u,v,uv=u*v,u2=u*u,v2=v*v)
srow=apply(df,2,sum)
df
##     x  y   u  v  uv  u2 v2
## 1 -10  5 -10 -2  20 100  4
## 2  -5  9  -5  2 -10  25  4
## 3   0  7   0  0   0   0  0
## 4   5 11   5  4  20  25 16
## 5  10 13  10  6  60 100 36
srow
##   x   y   u   v  uv  u2  v2 
##   0  45   0  10  90 250  60
n=length(x)
r= (n*srow[5]-srow[3]*srow[4])/
sqrt((n*srow[6]-srow[3]^2)*(n*srow[7]-srow[4]^2))

names(r)<-'correlation'
r
## correlation 
##         0.9
x=c(15.5, 16.5, 17.5, 18.5, 19.5, 20.5) 
y=c(75, 60, 50, 50, 45, 40) 
#Ans. r = 0·94.
data.frame(correlation=round(cor(x,y),2))
##   correlation
## 1       -0.94

Example 5

From the following data, compute the co-efficient of correlation between X and Y.

X Y
No. of items 15 15
Arithmetic mean 25 18
Sum of squared deviations from mean 136 138

Summation of product of deviations of X and Y series from respective arithmetic means = 122

Correlation coefficient is given by

\[ \begin{array}{rcl} r &=& \frac{COV(X,Y)}{\sqrt{VAR(X) \times VAR(Y)}}\\ &=& \frac{\frac{1}{N}\sum (X-\bar{X}) (Y-\bar{Y})} {\sqrt{ \frac{1}{N} \sum (X-\bar{X})^2 \frac{1}{N} \sum (Y-\bar{Y})^2 }}\\ &=& \frac{122}{\sqrt{136 \times 138}}\\ &=& 0.891 \end{array} \]

Example Bivariate correlation coeffient

d=c(4,2,2,0,5,4,6,4,6, 8, 10, 11,4,4,6,8,0,2, 4, 4,0,2, 3,1) 
fij=matrix(d,nrow=6,byrow=TRUE)
fij
##      [,1] [,2] [,3] [,4]
## [1,]    4    2    2    0
## [2,]    5    4    6    4
## [3,]    6    8   10   11
## [4,]    4    4    6    8
## [5,]    0    2    4    4
## [6,]    0    2    3    1
fxj=apply(fij,2,sum)
fxj
## [1] 19 22 31 28
fix=apply(fij,1,sum)
fix
## [1]  8 19 35 22 10  6
xi=seq(15,65,10)
xi
## [1] 15 25 35 45 55 65
yj=18:21
yj
## [1] 18 19 20 21
A=35
B=19
h=10

ui=(xi-A)/h
vj=yj-B

ui
## [1] -2 -1  0  1  2  3
vj
## [1] -1  0  1  2
N=sum(fix)

uifix=ui%*%fix

vjfxj=vj%*%fxj

ui2fix=ui^2%*%fix

vj2fxj=vj^2%*%fxj


uivjfij=t(ui)%*%(fij%*%vj)


res=data.frame(N,uifix,vjfxj,ui2fix,vj2fxj,uivjfij)


r=(N*uivjfij-uifix*vjfxj)/sqrt(
(N*ui2fix-uifix^2)*(N*vj2fxj-vjfxj^2))
res
##     N uifix vjfxj ui2fix vj2fxj uivjfij
## 1 100    25    68    167    162      52
names(r)<-'correlation coefficient'
r
##           [,1]
## [1,] 0.2565744
## attr(,"names")
## [1] "correlation coefficient"

\[ \begin{array}{rcl} r&=&\frac{N*\sum_i\sum_j x_iy_j f_{ij} - \left( \sum_ix_i f_{ix}\right) \left( \sum_jy_j f_{xj}\right) }{\sqrt{ \left[ N \sum_ix_i^2 f_{ix}-\left(\sum_ix_i f_{ix} \right)^2\right] \left[ N \sum_jy_j^2 f_{xj}-\left(\sum_jy_j f_{xj} \right)^2\right] }}\\ &=& \frac{100 \times 52}{\sqrt{ \left[100\times 167 -25^2\right] \left[100\times 162 -68^2\right] }}\\ &=& 0.257 \end{array} \]

  • [N] Size of the sample
  • [\(x_i\)] mid values of X class
  • [\(y_j\)] mid values of Y class
  • [\(f_{ij}\)] frequency ith class of X and jth class of Y
  • [\(f_{ix}\)] marginal frequency of ith class of X(row sum)
  • [\(f_{xj}\)] marginal frequency of jth class of Y(column sum)

Merits and Demerits of Pearson Product Moment Correlation

Merits - Accurate - Invariant in Location change - Bounded

De Merits - Computationlly intensive - Sensitive to changes in the data

Spearman Rank Correlation Coefficient

  • Rank Correlation is computed using the ranks instead of actual values.

Raw Formula for Rank Correlation

\[ rho = \frac{N\sum R_x \, R_y-\sum R_x \, \sum R_y}{ \sqrt{\left[ N\, \sum R_x^2 - (\sum R_x)^2 \right] \left[ N\, \sum R_y^2 - (\sum R_y)^2 \right] }} \]

where \(R_x\) and \(R_y\) are ranks.

Simple Formula for Non-Repeated Ranks

\[ rho = 1- \frac{6\times \sum_i d_i^2}{n(n^2-1)} \]

where \(d_i\) is the difference between the ranks of X and Y values of the ith observation.

Example 6

Find the rank correlatio of

x=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
y=c(1, 10, 3, 4, 5, 7, 2, 6, 8, 11, 15, 9, 14, 12, 16, 13)
n=length(x)

df=data.frame(x,y,di2=(x-y)^2)
srow=apply(df,2,sum)
srow
##   x   y di2 
## 136 136 136
rho=1-(6*sum(df$di2)/(n*(n^2-1)))
  
print(rho)
## [1] 0.8

Example 6

## ex2 rank correlation - Repeated ranks
X=c( 68, 64, 75, 50, 64, 80, 75, 40, 55, 64) 
Y=c( 62, 58, 68, 45, 81, 60, 68, 48, 50, 70)

round(cor(rank(X),rank(Y)),3)
## [1] 0.556
x=rank(X)
y=rank(Y)
n=length(x)

df=data.frame(x,y,di2=(x-y)^2)
srow=apply(df,2,sum)
df
##       x    y di2
## 1   7.0  6.0   1
## 2   5.0  4.0   1
## 3   8.5  7.5   1
## 4   2.0  1.0   1
## 5   5.0 10.0  25
## 6  10.0  5.0  25
## 7   8.5  7.5   1
## 8   1.0  2.0   1
## 9   3.0  3.0   0
## 10  5.0  9.0  16
srow
##   x   y di2 
##  55  55  72
rho=1-(6*sum(df$di2)/(n*(n^2-1)))
  
print(rho)
## [1] 0.5636364

Example 7

X=c(10, 15, 12, 17, 13, 16, 24,14, 22, 20) 
Y=c(30, 42, 45, 46, 33, 34, 40, 35, 39, 38)
u=X-16
v=Y-34
df=data.frame(X,Y,u,v,u*v,u^2,v^2)
df
##     X  Y  u  v u...v u.2 v.2
## 1  10 30 -6 -4    24  36  16
## 2  15 42 -1  8    -8   1  64
## 3  12 45 -4 11   -44  16 121
## 4  17 46  1 12    12   1 144
## 5  13 33 -3 -1     3   9   1
## 6  16 34  0  0     0   0   0
## 7  24 40  8  6    48  64  36
## 8  14 35 -2  1    -2   4   1
## 9  22 39  6  5    30  36  25
## 10 20 38  4  4    16  16  16
apply(df,2,sum)
##     X     Y     u     v u...v   u.2   v.2 
##   163   382     3    42    79   183   424

Regression