Correlation and Regression

V.Rajaraman

There is a close relationship between correlation and linear regression coefficients. Let us illustrate this with a simple example:

Generate some data first. x and y are related by the linear equation y = a * x + b

x = rnorm(100, 8)
y = 3 * x + 10 + rnorm(100, 0, 0.5)

I have chosen a slope a = 3 and intercept b = 10 for this example. I Added some random noise to it to make it look 'natural'.

Let us plot the data points and the regression line:

plot(x, y)
# fit a linear regression line:
lm1 = lm(y ~ x)
abline(lm1, col = "red")

plot of chunk unnamed-chunk-2

Examine the fitted linear model:

lm1
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##        9.76         3.03

So the estimated slope is 3.03 and the intercept is 9.76. Close enough to the actual values of 3 and 10.

Save the means of x and y for later use:

(mx = mean(x))
## [1] 8.197
(my = mean(y))
## [1] 34.59

Computing the intercept

The first insight we get is:

mean(y) = a * mean(x) + b

# this is what we estimated for intercept and slope:
round(coef(lm1), 2)
## (Intercept)           x 
##        9.76        3.03
# compute a*mx + b
computed.mean = coef(lm1)[2] * mx + coef(lm1)[1]
as.numeric(computed.mean)
## [1] 34.59
my  # compare with the actual mean(y)
## [1] 34.59

A direct corollary of this: The regression line always passes through the point (mx, my).

plot(x, y)
abline(lm1, col = "red")
abline(v = mx, col = "gray", lty = 2)
abline(h = my, col = "gray", lty = 2)
points(mx, my, pch = 19, cex = 2, col = "red")

plot of chunk unnamed-chunk-6

The gray lines mark the means of x and y.

Relation with Correlation

The slope of the regression line, that we have called a, is closely related to the correlation between x and y:

# print the correlation and standard deviations of x and y
cor(x, y)
## [1] 0.9845
sd(x)
## [1] 0.9278
sd(y)
## [1] 2.855
# the formula for computing the slope is:
cor(x, y) * sd(y)/sd(x)
## [1] 3.029
coef(lm1)[2]  # compare with our slope
##     x 
## 3.029

So the formula for the slope is this:
slope of regression line = cor(x,y) * sd(y)/ sd(x)

Once you have found the slope a, it is easy to calculate the intercept b using the formula we verified earlier:
my = a * mx + b
This can be rearranged as:
b = my - a * mx

Is this 'normal' ?

Normalization involves dividing a variable by its standard deviation. If x is any random variable and you divide every value of x by the sd(x), the resulting new variable will have a standard deviation of 1.

This technique is oftern combined with shifting the mean to the origin. To shift the mean of x to the origin, you simply subtract mean(x) from every value of x.

Let us normalize our x and y now. We call the normalized variables norx adn nory.

norx = (x - mean(x))/sd(x)
nory = (y - mean(y))/sd(y)

Plot the new variables, along with their regression line:

plot(norx, nory)
lm1 = lm(nory ~ norx)
abline(lm1, col = "red")

# as expected, the new means are zero (approximately !):
(mx = mean(norx))
## [1] 7.006e-16
(my = mean(nory))
## [1] -5.043e-16
abline(v = mx, col = "gray", lty = 2)
abline(h = my, col = "gray", lty = 2)
points(mx, my, pch = 19, cex = 2, col = "red")

plot of chunk unnamed-chunk-9

The plot looks very similar to the previous plot of non-normalized x and y. But look at the x and y axes. Look at the mean values.

Now compare the slope of the new regression line with the correlation coefficient of x and y :

lm1  # slope is the second element
## 
## Call:
## lm(formula = nory ~ norx)
## 
## Coefficients:
## (Intercept)         norx  
##   -1.12e-15     9.85e-01
cor(x, y)  # correlation coefficient of the original variables
## [1] 0.9845

We took the slope of the regression line of normalized variables corx and cory. Then computed the correlation coefficient of the original variables x and y. They are identical.