Two variates X and Y are said be correlated if the increase or decrease of one variable affects the increase or decrease in another variable.
X and Y are said to be positively correlated if Y increases if X increases and Y decreases if X decreases and vice versa. ie their both increases or both decreases.
X and Y are said to be positively correlated if Y decreases if X increases and Y increases if X decreases and vice versa. ie their both increases or both decreases.
If there is no change in Y with the increase of decrease in X
If \((x_i,y_i),i=1,2,\ldots,N\) be n paired data of X and Y, by taking one variable on X axis and the other on Y axis, the point plot is a scatter diagram.
Merits: 1. It is simple to draw. 2. Give quick visualization. 3. Good starting point.
Demerits: 1. It is rough and not accurate.
x=c(84, 66, 68, 129, 90, 91, 74, 76, 70, 85, 122, 74, 104, 97, 104, 91)
y=c(86, 82, 92, 151, 96, 86, 99, 88, 87, 98, 125, 85, 127, 108, 104, 102)
plot(x,y,main='positive correlated data',pch=16)
abline(h=mean(y),col=2)
abline(v=mean(x),col=4)
points(mean(x),mean(y),pch=14)
The coefficient of correlation \(r_{xy}\) is \[ r_{xy} = \frac{COV(X,Y)}{\sqrt{Var(X)Var(Y)}} \]
A simple to compute formula
\[ r_{xy} = \frac{N\sum X\, Y-\sum X \, \sum Y}{ \sqrt{\left[ N\, \sum X^2 - (\sum X)^2 \right] \left[ N\, \sum Y^2 - (\sum Y)^2 \right] }} \]
x 84 66 68 129 90 91 74 76 70 85 122 74 104 97 104 91
y 86 82 92 151 96 86 99 88 87 98 125 85 127 108 104 102
cov_xy=sum((x-mean(x))*(y-mean(y)))/length(x)
data.frame(mean(x),mean(y),cov_xy)
## mean.x. mean.y. cov_xy
## 1 89.0625 101 291.75
# sample correlation coefficient divides (n-1) insteads of n
cov(x,y)
## [1] 311.2
## to find simple correlation
x=c(84, 66, 68, 129, 90, 91, 74, 76, 70, 85, 122, 74, 104, 97, 104, 91)
y=c(86, 82, 92, 151, 96, 86, 99, 88, 87, 98, 125, 85, 127, 108, 104, 102)
##sample size
N=length(x)
## shift the origin A=90 and B=110
A=90
B=110
u=x-A
v=y-B
uv=u*v
u2=u*u
v2=v*v
df=data.frame(x,y,u,v,uv,u2,v2)
srow=apply(df,2,sum)
df = rbind(df,srow)
df
## x y u v uv u2 v2
## 1 84 86 -6 -24 144 36 576
## 2 66 82 -24 -28 672 576 784
## 3 68 92 -22 -18 396 484 324
## 4 129 151 39 41 1599 1521 1681
## 5 90 96 0 -14 0 0 196
## 6 91 86 1 -24 -24 1 576
## 7 74 99 -16 -11 176 256 121
## 8 76 88 -14 -22 308 196 484
## 9 70 87 -20 -23 460 400 529
## 10 85 98 -5 -12 60 25 144
## 11 122 125 32 15 480 1024 225
## 12 74 85 -16 -25 400 256 625
## 13 104 127 14 17 238 196 289
## 14 97 108 7 -2 -14 49 4
## 15 104 104 14 -6 -84 196 36
## 16 91 102 1 -8 -8 1 64
## 17 1425 1616 -15 -144 4803 5217 6658
#product moment formula
r_uv=(N*sum(uv)-sum(u)*sum(v))/
(sqrt(N*sum(u2)-sum(u)^2)*sqrt(sum(N*v2)-sum(v)^2))
data.frame(r_uv=round(r_uv,3))
## r_uv
## 1 0.884
#pearsons formula
pearson_r=cov(u,v)/sqrt(var(u)*var(v))
data.frame(pearson_r=round(pearson_r,3))
## pearson_r
## 1 0.884
The variables have strong positive correlation.
f<-function(x) x
curve(f(x),-1,1,lwd=2,col=2,main='linear correlation')
text(0.65,0.5,'+ve r',col=2)
curve(-f(x),-1,1,lwd=2,col=4,add = TRUE,lty=2)
text(0.65,-0.50,'-ve r',col=4)
\[ r_{xy}=\frac{b\times d}{|b|\times |d|}\, r_{uv} \]
x=c(-3, -2, -1, 1, 2,3)
y=c(9, 4, 1, 1, 4, 9)
df=data.frame(x,y,xy=x*y,x2=x*x,y2=y*y)
df
## x y xy x2 y2
## 1 -3 9 -27 9 81
## 2 -2 4 -8 4 16
## 3 -1 1 -1 1 1
## 4 1 1 1 1 1
## 5 2 4 8 4 16
## 6 3 9 27 9 81
apply(df,2,sum)
## x y xy x2 y2
## 0 28 0 28 196
print(c('correlation coefficient is 0 though y=x^2'))
## [1] "correlation coefficient is 0 though y=x^2"
rm(list=ls())
Calculate the correlauon coefficient for ihe following heights (in inches) of father’s (X) and his sons (Y) :
X : 65 66 67 67 68 69 70 72
Y: 61 68 6S 68 12 12 69 71
Method-1
x = c(65, 66, 67, 67, 68, 69, 70, 72)
y = c(67, 68, 65, 68, 72, 72, 69, 71)
df=data.frame(x,y,xy=x*y,x2=x*x,y2=y*y)
srow=apply(df,2,sum)
rbind(df,srow)
## x y xy x2 y2
## 1 65 67 4355 4225 4489
## 2 66 68 4488 4356 4624
## 3 67 65 4355 4489 4225
## 4 67 68 4556 4489 4624
## 5 68 72 4896 4624 5184
## 6 69 72 4968 4761 5184
## 7 70 69 4830 4900 4761
## 8 72 71 5112 5184 5041
## 9 544 552 37560 37028 38132
data.frame(srow)
## srow
## x 544
## y 552
## xy 37560
## x2 37028
## y2 38132
N=length(x)
rxy=(N*srow['xy']-srow['x']*srow['y'])/
sqrt((N*srow['x2']-srow['x']^2)*(N*srow['y2']-srow['y']^2))
names(rxy)<-'correlation x and y'
rxy
## correlation x and y
## 0.6030227
Show that the coeficient of correlation r is independent of a change of scale and origin of the variables. Also prove that for two independent variables r = O. Show by an example that the converse is not true. State the Iimits between which r lies and give its proof.
Let r be the correlation coefficient between two jointly distributed random variables X and Y. Show that \(|r|<1\) and that \(|r|=1\) if and only if X and Y are linearly related.
Calculate the coefficient of correlation between X and Y for the following:
X 1 3 4 5 7 8 10
Y 2 6 8 10 14 16 20
x=c(1, 3, 4, 5, 7, 8, 10)
y=c(2, 6, 8, 10, 14, 16, 20)
df=data.frame(x,y,xy=x*y,x2=x*x,y2=y*y)
df
## x y xy x2 y2
## 1 1 2 2 1 4
## 2 3 6 18 9 36
## 3 4 8 32 16 64
## 4 5 10 50 25 100
## 5 7 14 98 49 196
## 6 8 16 128 64 256
## 7 10 20 200 100 400
srow=apply(df,2,sum)
n=length(x)
r= (n*srow[3]-srow[1]*srow[2])/
sqrt((n*srow[4]-srow[1]^2)*(n*srow[5]-srow[2]^2))
names(r)<-'correlation'
r
## correlation
## 1
By effecting suitable change of origin and scale, compute product moment correlation coefficient for the following set of 5 observations on (X. Y) :
X: -10 -5 0 5 10
Y: 5 9 7 11 13
x=c(-10, -5, 0, 5, 10)
y=c(5, 9, 7, 11, 13)
A=0
B=7
u=x-A
v=y-B
df=data.frame(x,y,u,v,uv=u*v,u2=u*u,v2=v*v)
srow=apply(df,2,sum)
df
## x y u v uv u2 v2
## 1 -10 5 -10 -2 20 100 4
## 2 -5 9 -5 2 -10 25 4
## 3 0 7 0 0 0 0 0
## 4 5 11 5 4 20 25 16
## 5 10 13 10 6 60 100 36
srow
## x y u v uv u2 v2
## 0 45 0 10 90 250 60
n=length(x)
r= (n*srow[5]-srow[3]*srow[4])/
sqrt((n*srow[6]-srow[3]^2)*(n*srow[7]-srow[4]^2))
names(r)<-'correlation'
r
## correlation
## 0.9
x=c(15.5, 16.5, 17.5, 18.5, 19.5, 20.5)
y=c(75, 60, 50, 50, 45, 40)
#Ans. r = 0·94.
data.frame(correlation=round(cor(x,y),2))
## correlation
## 1 -0.94
From the following data, compute the co-efficient of correlation between X and Y.
| X | Y | |
|---|---|---|
| No. of items | 15 | 15 |
| Arithmetic mean | 25 | 18 |
| Sum of squared deviations from mean | 136 | 138 |
Summation of product of deviations of X and Y series from respective arithmetic means = 122
Correlation coefficient is given by
\[ \begin{array}{rcl} r &=& \frac{COV(X,Y)}{\sqrt{VAR(X) \times VAR(Y)}}\\ &=& \frac{\frac{1}{N}\sum (X-\bar{X}) (Y-\bar{Y})} {\sqrt{ \frac{1}{N} \sum (X-\bar{X})^2 \frac{1}{N} \sum (Y-\bar{Y})^2 }}\\ &=& \frac{122}{\sqrt{136 \times 138}}\\ &=& 0.891 \end{array} \]
d=c(4,2,2,0,5,4,6,4,6, 8, 10, 11,4,4,6,8,0,2, 4, 4,0,2, 3,1)
fij=matrix(d,nrow=6,byrow=TRUE)
fij
## [,1] [,2] [,3] [,4]
## [1,] 4 2 2 0
## [2,] 5 4 6 4
## [3,] 6 8 10 11
## [4,] 4 4 6 8
## [5,] 0 2 4 4
## [6,] 0 2 3 1
fxj=apply(fij,2,sum)
fxj
## [1] 19 22 31 28
fix=apply(fij,1,sum)
fix
## [1] 8 19 35 22 10 6
xi=seq(15,65,10)
xi
## [1] 15 25 35 45 55 65
yj=18:21
yj
## [1] 18 19 20 21
A=35
B=19
h=10
ui=(xi-A)/h
vj=yj-B
ui
## [1] -2 -1 0 1 2 3
vj
## [1] -1 0 1 2
N=sum(fix)
uifix=ui%*%fix
vjfxj=vj%*%fxj
ui2fix=ui^2%*%fix
vj2fxj=vj^2%*%fxj
uivjfij=t(ui)%*%(fij%*%vj)
res=data.frame(N,uifix,vjfxj,ui2fix,vj2fxj,uivjfij)
r=(N*uivjfij-uifix*vjfxj)/sqrt(
(N*ui2fix-uifix^2)*(N*vj2fxj-vjfxj^2))
res
## N uifix vjfxj ui2fix vj2fxj uivjfij
## 1 100 25 68 167 162 52
names(r)<-'correlation coefficient'
r
## [,1]
## [1,] 0.2565744
## attr(,"names")
## [1] "correlation coefficient"
\[ \begin{array}{rcl} r&=&\frac{N*\sum_i\sum_j x_iy_j f_{ij} - \left( \sum_ix_i f_{ix}\right) \left( \sum_jy_j f_{xj}\right) }{\sqrt{ \left[ N \sum_ix_i^2 f_{ix}-\left(\sum_ix_i f_{ix} \right)^2\right] \left[ N \sum_jy_j^2 f_{xj}-\left(\sum_jy_j f_{xj} \right)^2\right] }}\\ &=& \frac{100 \times 52}{\sqrt{ \left[100\times 167 -25^2\right] \left[100\times 162 -68^2\right] }}\\ &=& 0.257 \end{array} \]
Merits - Accurate - Invariant in Location change - Bounded
De Merits - Computationlly intensive - Sensitive to changes in the data
\[ rho = \frac{N\sum R_x \, R_y-\sum R_x \, \sum R_y}{ \sqrt{\left[ N\, \sum R_x^2 - (\sum R_x)^2 \right] \left[ N\, \sum R_y^2 - (\sum R_y)^2 \right] }} \]
where \(R_x\) and \(R_y\) are ranks.
\[ rho = 1- \frac{6\times \sum_i d_i^2}{n(n^2-1)} \]
where \(d_i\) is the difference between the ranks of X and Y values of the ith observation.
Find the rank correlatio of
x=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
y=c(1, 10, 3, 4, 5, 7, 2, 6, 8, 11, 15, 9, 14, 12, 16, 13)
n=length(x)
df=data.frame(x,y,di2=(x-y)^2)
srow=apply(df,2,sum)
srow
## x y di2
## 136 136 136
rho=1-(6*sum(df$di2)/(n*(n^2-1)))
print(rho)
## [1] 0.8
## ex2 rank correlation - Repeated ranks
X=c( 68, 64, 75, 50, 64, 80, 75, 40, 55, 64)
Y=c( 62, 58, 68, 45, 81, 60, 68, 48, 50, 70)
round(cor(rank(X),rank(Y)),3)
## [1] 0.556
x=rank(X)
y=rank(Y)
n=length(x)
df=data.frame(x,y,di2=(x-y)^2)
srow=apply(df,2,sum)
df
## x y di2
## 1 7.0 6.0 1
## 2 5.0 4.0 1
## 3 8.5 7.5 1
## 4 2.0 1.0 1
## 5 5.0 10.0 25
## 6 10.0 5.0 25
## 7 8.5 7.5 1
## 8 1.0 2.0 1
## 9 3.0 3.0 0
## 10 5.0 9.0 16
srow
## x y di2
## 55 55 72
rho=1-(6*sum(df$di2)/(n*(n^2-1)))
print(rho)
## [1] 0.5636364
X=c(10, 15, 12, 17, 13, 16, 24,14, 22, 20)
Y=c(30, 42, 45, 46, 33, 34, 40, 35, 39, 38)
u=X-16
v=Y-34
df=data.frame(X,Y,u,v,u*v,u^2,v^2)
df
## X Y u v u...v u.2 v.2
## 1 10 30 -6 -4 24 36 16
## 2 15 42 -1 8 -8 1 64
## 3 12 45 -4 11 -44 16 121
## 4 17 46 1 12 12 1 144
## 5 13 33 -3 -1 3 9 1
## 6 16 34 0 0 0 0 0
## 7 24 40 8 6 48 64 36
## 8 14 35 -2 1 -2 4 1
## 9 22 39 6 5 30 36 25
## 10 20 38 4 4 16 16 16
apply(df,2,sum)
## X Y u v u...v u.2 v.2
## 163 382 3 42 79 183 424
The word regression means going backwards first introduced by a British biometrician Sir Francis Galton (1822-1911) in studying the heights of the offsprings in connection with the inheritance.
Regression analysis is a mathematical measure of the average relationship between two or more variables in terms of the original units of the data.
In regression there will be one variable which will be predicted (Y) based on the values of other variables X. Y is called dependent variable and X independent variable.
If X is more than one then the regression is called multiple regression
Y is called regressed variable or explained variable or target variable
If the data points are clustered around some curve, it is called curve of regression.
if the curve is a polynomial and any non-linear then it is curvilinear regresion \[ Y=b_0+b_1\times x+b_2\times x^2+\cdots \]
power regression (Gas Equation) \[ Y=b_0 x^{b_1} \]
### Simple Linear Regression - Let \((x_i,y_i),i=1,2,\ldots,N\) be the data - To fit the model \(Y=b_0+b_1\times x\) - Write the normal equations
\[ \begin{array}{rcl} \sum y_i &=& N\, b_0 +b_1 \sum x_i \\ \sum x_i y_i &=& N\, b_0 \sum x_i+b_1 \sum x_i^2 \\ \end{array} \]
###Example1
Suppose the observations on X and Y are given as :
| X | 59 | 65 | 45 | 52 | 60 | 62 | 70 |
|---|---|---|---|---|---|---|---|
| Y | 75 | 70 | 55 | 65 | 60 | 69 | 80 |
where N = 10 students, and Y = Marks in Maths, X = Marks in Economics. Compute the least square regression equations of Y on X and of X on Y.
x=c(59, 65, 45, 52, 60, 62, 70)
y=c(75, 70, 55, 65, 60, 69, 80)
n=length(x)
u=x-52
v=y-65
df=data.frame(x,y,u,v,uv=u*v,u2=u*u,v2=v*v)
df
## x y u v uv u2 v2
## 1 59 75 7 10 70 49 100
## 2 65 70 13 5 65 169 25
## 3 45 55 -7 -10 70 49 100
## 4 52 65 0 0 0 0 0
## 5 60 60 8 -5 -40 64 25
## 6 62 69 10 4 40 100 16
## 7 70 80 18 15 270 324 225
apply(df,2,sum)
## x y u v uv u2 v2
## 413 474 49 19 475 755 491
A=matrix(c(n,sum(x),sum(x),sum(x^2)),nrow = 2)
A
## [,1] [,2]
## [1,] 7 413
## [2,] 413 24779
b=c(sum(y),sum(x*y))
print(b)
## [1] 474 28308
solve(A,b)
## [1] 18.7385576 0.8300971
We can use direct formula
model=lm(y~x)
model
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 18.7386 0.8301
If a student gets 61 marks in Economics, what would you estimate his marks in Maths to be ?
# model2=lm(v~u,df)
# model2
model_61=predict(model,newdata = data.frame(x=61))
print(model_61)
## 1
## 69.37448
\[ Y-\bar{Y}=r\, \frac{\sigma_y}{\sigma_x} (X-\bar{X}) \]
###Example 9.1 Fit a straight line to the following data. X: 1 2 3 4 6 8 Y. : 2·4 3 3·6 4 5 6
X=c(1,2,3,4,6,8)
Y=c(2.4,3,3.6,4,5,6)
lm(Y~X)
##
## Call:
## lm(formula = Y ~ X)
##
## Coefficients:
## (Intercept) X
## 1.9765 0.5059
df=data.frame(X,Y,X*Y,X^2)
df
## X Y X...Y X.2
## 1 1 2.4 2.4 1
## 2 2 3.0 6.0 4
## 3 3 3.6 10.8 9
## 4 4 4.0 16.0 16
## 5 6 5.0 30.0 36
## 6 8 6.0 48.0 64
srow=apply(df,2,sum)
srow
## X Y X...Y X.2
## 24.0 24.0 113.2 130.0
sx=sum(X)
sy=sum(Y)
sxy=sum(X*Y)
sx2=sum(X*X)
n=length(X)
xmat=matrix(c(n,sx,sx,sx2),nrow = 2)
xmat
## [,1] [,2]
## [1,] 6 24
## [2,] 24 130
b=c(sy,sxy)
b
## [1] 24.0 113.2
solve(xmat,b)
## [1] 1.9764706 0.5058824
###Example9.2 Fit a parabola of second degree to the following data: X: 0 1 2 3 4 Y: 1 1·8 1·3 2·5 6·3
X=c(0,1,2,3,4)
Y=c(1,1.8,1.3,2.5,6.3)
df=data.frame(X,Y,X^2,X^3,X^4,X*Y,X^2*Y)
df
## X Y X.2 X.3 X.4 X...Y X.2...Y
## 1 0 1.0 0 0 0 0.0 0.0
## 2 1 1.8 1 1 1 1.8 1.8
## 3 2 1.3 4 8 16 2.6 5.2
## 4 3 2.5 9 27 81 7.5 22.5
## 5 4 6.3 16 64 256 25.2 100.8
apply(df,2,sum)
## X Y X.2 X.3 X.4 X...Y X.2...Y
## 10.0 12.9 30.0 100.0 354.0 37.1 130.3
n=length(X)
sx=sum(X)
sy=sum(Y)
sx2=sum(X^2)
sx3=sum(X^3)
sx4=sum(X^4)
sxy=sum(X*Y)
sx2y=sum(X^2*Y)
dmat=matrix(c(n,sx,sx2,sx,sx2,sx3,sx2,sx3,sx4),nrow = 3)
dmat
## [,1] [,2] [,3]
## [1,] 5 10 30
## [2,] 10 30 100
## [3,] 30 100 354
b=c(sy,sxy,sx2y)
b
## [1] 12.9 37.1 130.3
solve(dmat,b)
## [1] 1.42 -1.07 0.55
A simple R command gives the following
lm(Y~X+I(X^2))
##
## Call:
## lm(formula = Y ~ X + I(X^2))
##
## Coefficients:
## (Intercept) X I(X^2)
## 1.42 -1.07 0.55
###Example 9.4
X :1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0 Y: 1.1, 1.3, 1.6, 2.6, 2.7, 3.4, 4.1
X=c(1.0,1.5,2.0,2.5,3.0,3.5,4.0)
Y=c(1.1,1.3,1.6,2.6,2.7,3.4,4.1)
lm(Y~X+I(X^2))
##
## Call:
## lm(formula = Y ~ X + I(X^2))
##
## Coefficients:
## (Intercept) X I(X^2)
## 0.5214 0.3786 0.1286
###Example 9.5
Fit an exponential curve of the form \(Y=a\, b^x\) to the following data :
X: 1 ·2 3 4 5 6 7 8
Y: 1·0 1·2 1·8 2·5 3·6 4·7 6·6 9·1
X=c(1,2,3,4,5,6,7,8)
Y=c(1.0,1.2,1.8,2.5,3.6,4.7,6.6,9.1)
y=round(log10(Y),3)
df=data.frame(X,Y,y)
df
## X Y y
## 1 1 1.0 0.000
## 2 2 1.2 0.079
## 3 3 1.8 0.255
## 4 4 2.5 0.398
## 5 5 3.6 0.556
## 6 6 4.7 0.672
## 7 7 6.6 0.820
## 8 8 9.1 0.959
b=10^coef(lm(y~X))[2]
a=10^coef(lm(y~X))[1]
round(data.frame(a,b),3)
## a b
## (Intercept) 0.682 1.383
#alternatively
model_expo<-nls(Y~a*(b^X),
start = list(a = 0.5, b = 0.2))
round(coef(model_expo),3)
## a b
## 0.690 1.381
Fit an equation of the form \(Y = ab^x\) to the following data :
X: 2 3 4 5 6
Y: 144 112·8 207·4 248·8 298·6
Ans. \(Y= 101·3\, (1·1961)^X\)
X=c(2,3,4,5,6)
Y=c(144,172.8,207.4,248.8,298.6)
#alternatively
model_expo<-nls(Y~a*(b^X),
start = list(a = 0.5, b = 1.0))
round(coef(model_expo),4)
## a b
## 100.0089 1.2000
X=c(2,3,4,5,6)
Y=c(144,172.8,207.4,248.8,298.6)
y=log(Y)
x2=X^2
xy=X*y
n=length(X)
df=data.frame(X,Y,y,x2,xy)
df
## X Y y x2 xy
## 1 2 144.0 4.969813 4 9.939627
## 2 3 172.8 5.152135 9 15.456405
## 3 4 207.4 5.334649 16 21.338597
## 4 5 248.8 5.516649 25 27.583247
## 5 6 298.6 5.699105 36 34.194629
srow=round(apply(df,2,sum),3)
srow
## X Y y x2 xy
## 20.000 1071.600 26.672 90.000 108.513
xmat=round(matrix(c(n,sum(X),sum(X),sum(xy)),nrow = 2),2)
rhs=c(sum(y),sum(xy))
rhs
## [1] 26.67235 108.51250
xmat
## [,1] [,2]
## [1,] 5 20.00
## [2,] 20 108.51
10^solve(xmat,rhs)
## [1] 1.198634e+05 1.158633e+00