Two variates X and Y are said be correlated if the increase or decrease of one variable affects the increase or decrease in another variable.
X and Y are said to be positively correlated if Y increases if X increases and Y decreases if X decreases and vice versa. ie their both increases or both decreases.
X and Y are said to be positively correlated if Y decreases if X increases and Y increases if X decreases and vice versa. ie their both increases or both decreases.
If there is no change in Y with the increase of decrease in X
If \((x_i,y_i),i=1,2,\ldots,N\) be n paired data of X and Y, by taking one variable on X axis and the other on Y axis, the point plot is a scatter diagram.
Merits: 1. It is simple to draw. 2. Give quick visualization. 3. Good starting point.
Demerits: 1. It is rough and not accurate.
x=c(84, 66, 68, 129, 90, 91, 74, 76, 70, 85, 122, 74, 104, 97, 104, 91)
y=c(86, 82, 92, 151, 96, 86, 99, 88, 87, 98, 125, 85, 127, 108, 104, 102)
plot(x,y,main='positive correlated data',pch=16)
abline(h=mean(y),col=2)
abline(v=mean(x),col=4)
points(mean(x),mean(y),pch=14)
The coefficient of correlation \(r_{xy}\) is \[ r_{xy} = \frac{COV(X,Y)}{\sqrt{Var(X)Var(Y)}} \]
A simple to compute formula
\[ r_{xy} = \frac{N\sum X\, Y-\sum X \, \sum Y}{ \sqrt{\left[ N\, \sum X^2 - (\sum X)^2 \right] \left[ N\, \sum Y^2 - (\sum Y)^2 \right] }} \]
x 84 66 68 129 90 91 74 76 70 85 122 74 104 97 104 91
y 86 82 92 151 96 86 99 88 87 98 125 85 127 108 104 102
cov_xy=sum((x-mean(x))*(y-mean(y)))/length(x)
data.frame(mean(x),mean(y),cov_xy)
## mean.x. mean.y. cov_xy
## 1 89.0625 101 291.75
# sample correlation coefficient divides (n-1) insteads of n
cov(x,y)
## [1] 311.2
## to find simple correlation
x=c(84, 66, 68, 129, 90, 91, 74, 76, 70, 85, 122, 74, 104, 97, 104, 91)
y=c(86, 82, 92, 151, 96, 86, 99, 88, 87, 98, 125, 85, 127, 108, 104, 102)
##sample size
N=length(x)
## shift the origin A=90 and B=110
A=90
B=110
u=x-A
v=y-B
uv=u*v
u2=u*u
v2=v*v
df=data.frame(x,y,u,v,uv,u2,v2)
srow=apply(df,2,sum)
df = rbind(df,srow)
df
## x y u v uv u2 v2
## 1 84 86 -6 -24 144 36 576
## 2 66 82 -24 -28 672 576 784
## 3 68 92 -22 -18 396 484 324
## 4 129 151 39 41 1599 1521 1681
## 5 90 96 0 -14 0 0 196
## 6 91 86 1 -24 -24 1 576
## 7 74 99 -16 -11 176 256 121
## 8 76 88 -14 -22 308 196 484
## 9 70 87 -20 -23 460 400 529
## 10 85 98 -5 -12 60 25 144
## 11 122 125 32 15 480 1024 225
## 12 74 85 -16 -25 400 256 625
## 13 104 127 14 17 238 196 289
## 14 97 108 7 -2 -14 49 4
## 15 104 104 14 -6 -84 196 36
## 16 91 102 1 -8 -8 1 64
## 17 1425 1616 -15 -144 4803 5217 6658
#product moment formula
r_uv=(N*sum(uv)-sum(u)*sum(v))/
(sqrt(N*sum(u2)-sum(u)^2)*sqrt(sum(N*v2)-sum(v)^2))
data.frame(r_uv=round(r_uv,3))
## r_uv
## 1 0.884
#pearsons formula
pearson_r=cov(u,v)/sqrt(var(u)*var(v))
data.frame(pearson_r=round(pearson_r,3))
## pearson_r
## 1 0.884
The variables have strong positive correlation.
f<-function(x) x
curve(f(x),-1,1,lwd=2,col=2,main='linear correlation')
text(0.65,0.5,'+ve r',col=2)
curve(-f(x),-1,1,lwd=2,col=4,add = TRUE,lty=2)
text(0.65,-0.50,'-ve r',col=4)
\[ r_{xy}=\frac{b\times d}{|b|\times |d|}\, r_{uv} \]
x=c(-3, -2, -1, 1, 2,3)
y=c(9, 4, 1, 1, 4, 9)
df=data.frame(x,y,xy=x*y,x2=x*x,y2=y*y)
df
## x y xy x2 y2
## 1 -3 9 -27 9 81
## 2 -2 4 -8 4 16
## 3 -1 1 -1 1 1
## 4 1 1 1 1 1
## 5 2 4 8 4 16
## 6 3 9 27 9 81
apply(df,2,sum)
## x y xy x2 y2
## 0 28 0 28 196
print(c('correlation coefficient is 0 though y=x^2'))
## [1] "correlation coefficient is 0 though y=x^2"
rm(list=ls())
Calculate the correlauon coefficient for ihe following heights (in inches) of father’s (X) and his sons (Y) :
X : 65 66 67 67 68 69 70 72
Y: 61 68 6S 68 12 12 69 71
Method-1
x = c(65, 66, 67, 67, 68, 69, 70, 72)
y = c(67, 68, 65, 68, 72, 72, 69, 71)
df=data.frame(x,y,xy=x*y,x2=x*x,y2=y*y)
srow=apply(df,2,sum)
rbind(df,srow)
## x y xy x2 y2
## 1 65 67 4355 4225 4489
## 2 66 68 4488 4356 4624
## 3 67 65 4355 4489 4225
## 4 67 68 4556 4489 4624
## 5 68 72 4896 4624 5184
## 6 69 72 4968 4761 5184
## 7 70 69 4830 4900 4761
## 8 72 71 5112 5184 5041
## 9 544 552 37560 37028 38132
data.frame(srow)
## srow
## x 544
## y 552
## xy 37560
## x2 37028
## y2 38132
N=length(x)
rxy=(N*srow['xy']-srow['x']*srow['y'])/
sqrt((N*srow['x2']-srow['x']^2)*(N*srow['y2']-srow['y']^2))
names(rxy)<-'correlation x and y'
rxy
## correlation x and y
## 0.6030227
Show that the coeficient of correlation r is independent of a change of scale and origin of the variables. Also prove that for two independent variables r = O. Show by an example that the converse is not true. State the Iimits between which r lies and give its proof.
Let r be the correlation coefficient between two jointly distributed random variables X and Y. Show that \(|r|<1\) and that \(|r|=1\) if and only if X and Y are linearly related.
Calculate the coefficient of correlation between X and Y for the following:
X 1 3 4 5 7 8 10
Y 2 6 8 10 14 16 20
x=c(1, 3, 4, 5, 7, 8, 10)
y=c(2, 6, 8, 10, 14, 16, 20)
df=data.frame(x,y,xy=x*y,x2=x*x,y2=y*y)
df
## x y xy x2 y2
## 1 1 2 2 1 4
## 2 3 6 18 9 36
## 3 4 8 32 16 64
## 4 5 10 50 25 100
## 5 7 14 98 49 196
## 6 8 16 128 64 256
## 7 10 20 200 100 400
srow=apply(df,2,sum)
n=length(x)
r= (n*srow[3]-srow[1]*srow[2])/
sqrt((n*srow[4]-srow[1]^2)*(n*srow[5]-srow[2]^2))
names(r)<-'correlation'
r
## correlation
## 1
By effecting suitable change of origin and scale, compute product moment correlation coefficient for the following set of 5 observations on (X. Y) :
X: -10 -5 0 5 10
Y: 5 9 7 11 13
x=c(-10, -5, 0, 5, 10)
y=c(5, 9, 7, 11, 13)
A=0
B=7
u=x-A
v=y-B
df=data.frame(x,y,u,v,uv=u*v,u2=u*u,v2=v*v)
srow=apply(df,2,sum)
df
## x y u v uv u2 v2
## 1 -10 5 -10 -2 20 100 4
## 2 -5 9 -5 2 -10 25 4
## 3 0 7 0 0 0 0 0
## 4 5 11 5 4 20 25 16
## 5 10 13 10 6 60 100 36
srow
## x y u v uv u2 v2
## 0 45 0 10 90 250 60
n=length(x)
r= (n*srow[5]-srow[3]*srow[4])/
sqrt((n*srow[6]-srow[3]^2)*(n*srow[7]-srow[4]^2))
names(r)<-'correlation'
r
## correlation
## 0.9
x=c(15.5, 16.5, 17.5, 18.5, 19.5, 20.5)
y=c(75, 60, 50, 50, 45, 40)
#Ans. r = 0·94.
data.frame(correlation=round(cor(x,y),2))
## correlation
## 1 -0.94
From the following data, compute the co-efficient of correlation between X and Y.
| X | Y | |
|---|---|---|
| No. of items | 15 | 15 |
| Arithmetic mean | 25 | 18 |
| Sum of squared deviations from mean | 136 | 138 |
Summation of product of deviations of X and Y series from respective arithmetic means = 122
Correlation coefficient is given by
\[ \begin{array}{rcl} r &=& \frac{COV(X,Y)}{\sqrt{VAR(X) \times VAR(Y)}}\\ &=& \frac{\frac{1}{N}\sum (X-\bar{X}) (Y-\bar{Y})} {\sqrt{ \frac{1}{N} \sum (X-\bar{X})^2 \frac{1}{N} \sum (Y-\bar{Y})^2 }}\\ &=& \frac{122}{\sqrt{136 \times 138}}\\ &=& 0.891 \end{array} \]
d=c(4,2,2,0,5,4,6,4,6, 8, 10, 11,4,4,6,8,0,2, 4, 4,0,2, 3,1)
fij=matrix(d,nrow=6,byrow=TRUE)
fij
## [,1] [,2] [,3] [,4]
## [1,] 4 2 2 0
## [2,] 5 4 6 4
## [3,] 6 8 10 11
## [4,] 4 4 6 8
## [5,] 0 2 4 4
## [6,] 0 2 3 1
fxj=apply(fij,2,sum)
fxj
## [1] 19 22 31 28
fix=apply(fij,1,sum)
fix
## [1] 8 19 35 22 10 6
xi=seq(15,65,10)
xi
## [1] 15 25 35 45 55 65
yj=18:21
yj
## [1] 18 19 20 21
A=35
B=19
h=10
ui=(xi-A)/h
vj=yj-B
ui
## [1] -2 -1 0 1 2 3
vj
## [1] -1 0 1 2
N=sum(fix)
uifix=ui%*%fix
vjfxj=vj%*%fxj
ui2fix=ui^2%*%fix
vj2fxj=vj^2%*%fxj
uivjfij=t(ui)%*%(fij%*%vj)
res=data.frame(N,uifix,vjfxj,ui2fix,vj2fxj,uivjfij)
r=(N*uivjfij-uifix*vjfxj)/sqrt(
(N*ui2fix-uifix^2)*(N*vj2fxj-vjfxj^2))
res
## N uifix vjfxj ui2fix vj2fxj uivjfij
## 1 100 25 68 167 162 52
names(r)<-'correlation coefficient'
r
## [,1]
## [1,] 0.2565744
## attr(,"names")
## [1] "correlation coefficient"
\[ \begin{array}{rcl} r&=&\frac{N*\sum_i\sum_j x_iy_j f_{ij} - \left( \sum_ix_i f_{ix}\right) \left( \sum_jy_j f_{xj}\right) }{\sqrt{ \left[ N \sum_ix_i^2 f_{ix}-\left(\sum_ix_i f_{ix} \right)^2\right] \left[ N \sum_jy_j^2 f_{xj}-\left(\sum_jy_j f_{xj} \right)^2\right] }}\\ &=& \frac{100 \times 52}{\sqrt{ \left[100\times 167 -25^2\right] \left[100\times 162 -68^2\right] }}\\ &=& 0.257 \end{array} \]
Merits - Accurate - Invariant in Location change - Bounded
De Merits - Computationlly intensive - Sensitive to changes in the data
\[ rho = \frac{N\sum R_x \, R_y-\sum R_x \, \sum R_y}{ \sqrt{\left[ N\, \sum R_x^2 - (\sum R_x)^2 \right] \left[ N\, \sum R_y^2 - (\sum R_y)^2 \right] }} \]
where \(R_x\) and \(R_y\) are ranks.
\[ rho = 1- \frac{6\times \sum_i d_i^2}{n(n^2-1)} \]
where \(d_i\) is the difference between the ranks of X and Y values of the ith observation.
Find the rank correlatio of
x=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
y=c(1, 10, 3, 4, 5, 7, 2, 6, 8, 11, 15, 9, 14, 12, 16, 13)
n=length(x)
df=data.frame(x,y,di2=(x-y)^2)
srow=apply(df,2,sum)
srow
## x y di2
## 136 136 136
rho=1-(6*sum(df$di2)/(n*(n^2-1)))
print(rho)
## [1] 0.8
## ex2 rank correlation - Repeated ranks
X=c( 68, 64, 75, 50, 64, 80, 75, 40, 55, 64)
Y=c( 62, 58, 68, 45, 81, 60, 68, 48, 50, 70)
round(cor(rank(X),rank(Y)),3)
## [1] 0.556
x=rank(X)
y=rank(Y)
n=length(x)
df=data.frame(x,y,di2=(x-y)^2)
srow=apply(df,2,sum)
df
## x y di2
## 1 7.0 6.0 1
## 2 5.0 4.0 1
## 3 8.5 7.5 1
## 4 2.0 1.0 1
## 5 5.0 10.0 25
## 6 10.0 5.0 25
## 7 8.5 7.5 1
## 8 1.0 2.0 1
## 9 3.0 3.0 0
## 10 5.0 9.0 16
srow
## x y di2
## 55 55 72
rho=1-(6*sum(df$di2)/(n*(n^2-1)))
print(rho)
## [1] 0.5636364
X=c(10, 15, 12, 17, 13, 16, 24,14, 22, 20)
Y=c(30, 42, 45, 46, 33, 34, 40, 35, 39, 38)
u=X-16
v=Y-34
df=data.frame(X,Y,u,v,u*v,u^2,v^2)
df
## X Y u v u...v u.2 v.2
## 1 10 30 -6 -4 24 36 16
## 2 15 42 -1 8 -8 1 64
## 3 12 45 -4 11 -44 16 121
## 4 17 46 1 12 12 1 144
## 5 13 33 -3 -1 3 9 1
## 6 16 34 0 0 0 0 0
## 7 24 40 8 6 48 64 36
## 8 14 35 -2 1 -2 4 1
## 9 22 39 6 5 30 36 25
## 10 20 38 4 4 16 16 16
apply(df,2,sum)
## X Y u v u...v u.2 v.2
## 163 382 3 42 79 183 424