In memoriam to Karl Pearson (27/3/1857- 27/04/1936)
Introduction
Today (27/03/2017), we will celebrate the 160th birthday of Karl Pearson by demonstrating the link between his famous Cross-product moment coefficient (r) and the linear regression model. This note also belongs to our tutorial series about Linear models in ESMS project. Last time, we introduced the Least square algorithm and the syntax of lm( ) function in basic R. The present document will show you how to explore the linear model output.
Study problem
In this chapter, we continue our experiment on the dataset of Cardiac catheterisation. Our study question will focus on a simple linear regression model to estimate the catheter’s length (Y) in terms of patient’s height (X).
Data preparation, Simple linear regression analysis and exploration
library(tidyverse)
data(heart,package="robustbase")
heart=heart%>%as_tibble()
heart$clength=as.numeric(heart$clength)
muX=mean(heart$height)
muY=mean(heart$clength)
X=heart$height
Y=heart$clength
N=12
x=X-muX
y=Y-muY
We introduce 2 vectors of Response variable (Y=length of catheter, in cm) and Predictor (X=Height in cm), the sample size N is 12. Two vectors x and y correspond to the deviation score of X and Y, respectively.
\[ x_{i}=X_{i}-{\overline{X}} \ and \ y_{i}=Y_{i}-{\overline{Y}} \]
The product between 2 deviations xi and yi is called crossproduct.
Mean of all crossproducts will determine the covariance between X and Y:
\[ Cov(XY)=\frac{\sum_{i=1}^{N}(x_{i}y_{i})}{N} \]
The covariance of a variable and itself determines its Variance: For example the variance of Y could be estimated as:
\[ Var(Y)= \sum_{i=1}^{N}\frac{y_{i}^{2}}{N} = \frac{N\sum_{i=1}^{N}(Y_{i}^{2})-(\sum_{i=1}^{N}Y_{i})^2 }{N^2} \]
The standard deviation (Sd) of a variable is squareroot of its Variance :
\[ Sd(Y) = \sqrt{Var(Y)} \]
The Pearson’s crossproduct moment coefficient (r) between X and Y, also known as correlation coefficient is defined by:
\[ r_{X,Y}=\frac{Cov(XY)}{sd(X)*sd(Y)} \]
This can be verified by comparing the manual calculation of r coefficient and the output of cor( ) function:
cor(X,Y)
## [1] 0.8927873
cov(X,Y)/(sd(X)*sd(Y))
## [1] 0.8927873
The Pearson’s r coefficient was 0.89, indicating a strong, positive correlation between X and Y
As mentioned in the 3rd tutorial: http://rpubs.com/ledongnhatnam/254581, the Pearson’s r coefficient measures the strength and direction of the linear relationship between two variables.
Now we return to our main topic, that implies a simple linear regresion model analysis. Our objective is to determine a regression equation, or the optimal values of two regression coefficients b0 (Intercept) and b1 (slope) such that the estimation error for Y could be minimised.
As mentioned before, our regression model allows to reduce the original data of 12 dimensions down to a two dimensional space. Among every possible lines that could be drawn through this two dimensional space, our regression line is the only one with smallest sum of squared error.
plot(y=Y,x=X,pch=16)
The figure below represents the Total sum of square (SST, in blue), which is determined by:
\[ SST = \sum_{i=1}^{N}(Yi-\overline{Y})^2 \]
SST=sum(y^2)
SST
## [1] 735.6667
plot(y=Y,x=X,pch=16)
segments(x0=X,y0=Y,x1=X,y1=muY,col="blue",lty=1)
abline(h=muY,col="blue",lty=3)
We have Total Sum of square = 735.6667
Now we buid a simple linear model using the lm( ) function
mod=lm(Y~X)
summary(mod)
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.0299 -0.6908 -0.2175 1.1908 6.3345
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.47898 4.09405 2.804 0.0187 *
## X 0.61171 0.09761 6.267 9.3e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.864 on 10 degrees of freedom
## Multiple R-squared: 0.7971, Adjusted R-squared: 0.7768
## F-statistic: 39.28 on 1 and 10 DF, p-value: 9.295e-05
The model summary output provides following information:
The Regression coefficients : Intercept b0 and b1 for predictor X, their standard error, t value and p_value for testing the hypothesis of null value for b0 and b1. We will explain those elements later.
The residual standard error, where residual is determined by the difference between true observed values and predicted values
R-squared or coefficient of determination, a statistic for model’s goodness of fit. There are the original value (multiple R2) and adjusted value. We will also explain these statistics later in this chapter
The F-statistic : This element will be soon explained in the next chapter.
In this document we will only focus on 2 elements: Regression coefficients, including b0, b1 and Coefficient of determination (Rsquared). We will also show the link between those elements and the Pearson’s r coefficient.
First, we will generate a new vector for prediction, or the estimated Y in terms of X
pred=predict(mod,x=X)
The following plot will explain the 1st component in our sum of squares - the residual sum of squares (in red), or the part that cannot be explained by the model.
plot(y=Y,x=X,pch=16)
abline(mod,col="red")
abline(h=muY,col="blue",lty=3)
segments(x0=X,y0=Y,x1=X,y1=pred,col="red",lty=1)
The Residual sum of square is determined by:
\[ SSR = \sum_{i=1}^{N}(Yi-\hat{Y})^2 \]
The 2nd component of sum of square is called Model sum of squares (SSM), or the error that could be explained by the model:
plot(y=Y,x=X,pch=16)
abline(mod,col="red")
abline(h=muY,col="blue",lty=3)
segments(x0=X,y0=pred,x1=X,y1=muY,col="purple",lty=1)
\[ SSM = \sum_{i=1}^{N}(\overline{Y}-\hat{Y})^2 \]
We can estimate the SSR, SSM and the Residual Mean square (RMS)
SSR=sum((Y-pred)^2)
SSM=sum((mean(pred)-pred)^2)
RMS=(SST-SSM)/(N-2)
rbind(SST,SSR,SSM,RMS)
## [,1]
## SST 735.66667
## SSR 149.28940
## SSM 586.37727
## RMS 14.92894
Those results could also be found in the ANOVA table of our model:
summary.aov(mod)
## Df Sum Sq Mean Sq F value Pr(>F)
## X 1 586.4 586.4 39.28 9.3e-05 ***
## Residuals 10 149.3 14.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
LINK BETWEEN PEARSON’S r AND LINEAR REGRESSION MODEL
There are close relationships between the Pearson’s r coefficient and the linear regression model:
\[ b_{X_{i}} =\frac{Cov(X_{i}Y)}{Var(X_{i})} = r_{X,Y}\tfrac{sdY}{sdX} \]
\[ b_{0} = \overline{Y}- b_{i}X \]
\[ Pearson's \ r(X,Y) = \sqrt{R^2} \]
We can easily vefify this as follows:
cor_r=cor(X,Y)
manual_r=cov(X,Y)/(sd(X)*sd(Y))
sqrt_R2=mod%>%summary()%>%.$r.squared%>%sqrt()
b1=cor(X,Y)*(sd(Y)/sd(X))
b1_r=b1/(sd(Y)/sd(X))
b0=mean(Y)-b1*mean(X)
rbind(cor_r,manual_r,sqrt_R2,b1_r)
## [,1]
## cor_r 0.8927873
## manual_r 0.8927873
## sqrt_R2 0.8927873
## b1_r 0.8927873
rbind(b0,b1)
## [,1]
## b0 11.4789756
## b1 0.6117124
mod%>%summary()%>%.$coefficients%>%.[,1]
## (Intercept) X
## 11.4789756 0.6117124
Discussion
Both r and regression coefficient (b1) allow to evaluate the association between X and Y. It’s easy to see that r and b1 always have the same sign, so both of them can be used as a good indicator for the direction of correlation Y~X (positive, negative).
When b1 is truly null (equal to zero), so will be r. So the statistical significance of both r and b1 can be tested by the t test. Indeed, the t values were the same for b1 and for r (t=6.267208), so were p_values.
cor.test(X,Y)
##
## Pearson's product-moment correlation
##
## data: X and Y
## t = 6.2672, df = 10, p-value = 9.295e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6539530 0.9697937
## sample estimates:
## cor
## 0.8927873
mod%>%summary()%>%.$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.4789756 4.09405157 2.803818 1.867185e-02
## X 0.6117124 0.09760525 6.267208 9.295117e-05
The regression coefficient provides more information and ability of interpretation than the Pearson’s r. Therefore the linear regression could be considered a better way to verify the causative relationship between a response variable Y and a factor /covariate X. For instance, only b1 could be interpreted as: “Y would increase b1 unit for each increased unit of X”, while the r value can only measure the strength and direction of correlation.
cor.test(X,Y)
##
## Pearson's product-moment correlation
##
## data: X and Y
## t = 6.2672, df = 10, p-value = 9.295e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6539530 0.9697937
## sample estimates:
## cor
## 0.8927873
lm(scale(Y)~scale(X))%>%summary()%>%.$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.275046e-16 0.1363891 3.134449e-15 1.000000e+00
## scale(X) 8.927873e-01 0.1424538 6.267208e+00 9.295117e-05
As we can see, the Pearson’s r is now EQUAL to the model’s b1 value Unlike b, The Pearson’s r will not be influenced by scale or unit of measurement. Be careful when we want to apply a Transformation (Box-Cox, for example) on response variable or predictor, because we can no more interpret the b value as it was in original scale.
It has been shown that the r value is more sensitive to the restriction on X, or the range of measurement. This could be seen on experiments when X value is controlled and manipulated by the study design but does not represent the full scale of X in real world (population). Regression coefficient is less influenced by such restriction. There fore a linear regression analysis is more appropriate than Pearson’s r for experiments. Regression analysis is also better for verify the potential effect of X on the estimation of Y, or for being interpreted as causative relationship.
When Y depends on more than one factor/covariate, these partial relationship could be better estimated by Pearson’s r (r will be reduced for each partial effect), but regression coefficient will not be affected. Pearson’s r is a true measure of the partial relationship, relative to other cofactor or covariates, while b coefficient measures the absolute size of relationship, by isolating other factors.
Reference:
Richard B. Darlington and Andrew F. Hayes. Regression Analysis and Linear Models, 2nd Ed. The Gulford Press (2017)
Julian J Faraway. Linear models with R. 2nd Ed. CRC Press (2017)
See you next time and Thank for joining us