Case study 5.2: Linear models Part 2: Anatomy of a simple linear model

In memoriam to Karl Pearson (27/3/1857- 27/04/1936)

Introduction

Today (27/03/2017), we will celebrate the 160th birthday of Karl Pearson by demonstrating the link between his famous Cross-product moment coefficient (r) and the linear regression model. This note also belongs to our tutorial series about Linear models in ESMS project. Last time, we introduced the Least square algorithm and the syntax of lm( ) function in basic R. The present document will show you how to explore the linear model output.

Study problem

In this chapter, we continue our experiment on the dataset of Cardiac catheterisation. Our study question will focus on a simple linear regression model to estimate the catheter’s length (Y) in terms of patient’s height (X).

Data preparation, Simple linear regression analysis and exploration

library(tidyverse)

data(heart,package="robustbase")

heart=heart%>%as_tibble()
heart$clength=as.numeric(heart$clength)

muX=mean(heart$height)
muY=mean(heart$clength)
X=heart$height
Y=heart$clength
N=12
x=X-muX
y=Y-muY

We introduce 2 vectors of Response variable (Y=length of catheter, in cm) and Predictor (X=Height in cm), the sample size N is 12. Two vectors x and y correspond to the deviation score of X and Y, respectively.

\[ x_{i}=X_{i}-{\overline{X}} \ and \ y_{i}=Y_{i}-{\overline{Y}} \]

The product between 2 deviations xi and yi is called crossproduct.

Mean of all crossproducts will determine the covariance between X and Y:

\[ Cov(XY)=\frac{\sum_{i=1}^{N}(x_{i}y_{i})}{N} \]

The covariance of a variable and itself determines its Variance: For example the variance of Y could be estimated as:

\[ Var(Y)= \sum_{i=1}^{N}\frac{y_{i}^{2}}{N} = \frac{N\sum_{i=1}^{N}(Y_{i}^{2})-(\sum_{i=1}^{N}Y_{i})^2 }{N^2} \]

The standard deviation (Sd) of a variable is squareroot of its Variance :

\[ Sd(Y) = \sqrt{Var(Y)} \]

The Pearson’s crossproduct moment coefficient (r) between X and Y, also known as correlation coefficient is defined by:

\[ r_{X,Y}=\frac{Cov(XY)}{sd(X)*sd(Y)} \]

This can be verified by comparing the manual calculation of r coefficient and the output of cor( ) function:

cor(X,Y)

## [1] 0.8927873

cov(X,Y)/(sd(X)*sd(Y))

## [1] 0.8927873

The Pearson’s r coefficient was 0.89, indicating a strong, positive correlation between X and Y

As mentioned in the 3rd tutorial: http://rpubs.com/ledongnhatnam/254581, the Pearson’s r coefficient measures the strength and direction of the linear relationship between two variables.

Now we return to our main topic, that implies a simple linear regresion model analysis. Our objective is to determine a regression equation, or the optimal values of two regression coefficients b0 (Intercept) and b1 (slope) such that the estimation error for Y could be minimised.

As mentioned before, our regression model allows to reduce the original data of 12 dimensions down to a two dimensional space. Among every possible lines that could be drawn through this two dimensional space, our regression line is the only one with smallest sum of squared error.

plot(y=Y,x=X,pch=16)

The figure below represents the Total sum of square (SST, in blue), which is determined by:

\[ SST = \sum_{i=1}^{N}(Yi-\overline{Y})^2 \]

SST=sum(y^2)

SST

## [1] 735.6667

plot(y=Y,x=X,pch=16)
segments(x0=X,y0=Y,x1=X,y1=muY,col="blue",lty=1)
abline(h=muY,col="blue",lty=3)

We have Total Sum of square = 735.6667

Now we buid a simple linear model using the lm( ) function

mod=lm(Y~X)

summary(mod)

## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.0299 -0.6908 -0.2175  1.1908  6.3345 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.47898    4.09405   2.804   0.0187 *  
## X            0.61171    0.09761   6.267  9.3e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.864 on 10 degrees of freedom
## Multiple R-squared:  0.7971, Adjusted R-squared:  0.7768 
## F-statistic: 39.28 on 1 and 10 DF,  p-value: 9.295e-05

The model summary output provides following information:

The Regression coefficients : Intercept b0 and b1 for predictor X, their standard error, t value and p_value for testing the hypothesis of null value for b0 and b1. We will explain those elements later.
The residual standard error, where residual is determined by the difference between true observed values and predicted values
R-squared or coefficient of determination, a statistic for model’s goodness of fit. There are the original value (multiple R2) and adjusted value. We will also explain these statistics later in this chapter
The F-statistic : This element will be soon explained in the next chapter.

In this document we will only focus on 2 elements: Regression coefficients, including b0, b1 and Coefficient of determination (Rsquared). We will also show the link between those elements and the Pearson’s r coefficient.

First, we will generate a new vector for prediction, or the estimated Y in terms of X

pred=predict(mod,x=X)

The following plot will explain the 1st component in our sum of squares - the residual sum of squares (in red), or the part that cannot be explained by the model.

plot(y=Y,x=X,pch=16)
abline(mod,col="red")
abline(h=muY,col="blue",lty=3)
segments(x0=X,y0=Y,x1=X,y1=pred,col="red",lty=1)

The Residual sum of square is determined by:

\[ SSR = \sum_{i=1}^{N}(Yi-\hat{Y})^2 \]

The 2nd component of sum of square is called Model sum of squares (SSM), or the error that could be explained by the model:

plot(y=Y,x=X,pch=16)
abline(mod,col="red")
abline(h=muY,col="blue",lty=3)
segments(x0=X,y0=pred,x1=X,y1=muY,col="purple",lty=1)

\[ SSM = \sum_{i=1}^{N}(\overline{Y}-\hat{Y})^2 \]

We can estimate the SSR, SSM and the Residual Mean square (RMS)

SSR=sum((Y-pred)^2)
SSM=sum((mean(pred)-pred)^2)
RMS=(SST-SSM)/(N-2)

rbind(SST,SSR,SSM,RMS)

##          [,1]
## SST 735.66667
## SSR 149.28940
## SSM 586.37727
## RMS  14.92894

Those results could also be found in the ANOVA table of our model:

summary.aov(mod)

##             Df Sum Sq Mean Sq F value  Pr(>F)    
## X            1  586.4   586.4   39.28 9.3e-05 ***
## Residuals   10  149.3    14.9                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

LINK BETWEEN PEARSON’S r AND LINEAR REGRESSION MODEL

There are close relationships between the Pearson’s r coefficient and the linear regression model:

First, the regression coefficient for our predictor X could be estimated from the Pearson’s r as follows:

\[ b_{X_{i}} =\frac{Cov(X_{i}Y)}{Var(X_{i})} = r_{X,Y}\tfrac{sdY}{sdX} \]

\[ b_{0} = \overline{Y}- b_{i}X \]

The Pearson’s r coefficient could also be determined as the squareroot of the coefficient of determination (Rsquare)

\[ Pearson's \ r(X,Y) = \sqrt{R^2} \]

We can easily vefify this as follows:

cor_r=cor(X,Y)

manual_r=cov(X,Y)/(sd(X)*sd(Y))

sqrt_R2=mod%>%summary()%>%.$r.squared%>%sqrt()

b1=cor(X,Y)*(sd(Y)/sd(X))

b1_r=b1/(sd(Y)/sd(X))

b0=mean(Y)-b1*mean(X)

rbind(cor_r,manual_r,sqrt_R2,b1_r)

##               [,1]
## cor_r    0.8927873
## manual_r 0.8927873
## sqrt_R2  0.8927873
## b1_r     0.8927873

rbind(b0,b1)

##          [,1]
## b0 11.4789756
## b1  0.6117124

mod%>%summary()%>%.$coefficients%>%.[,1]

## (Intercept)           X 
##  11.4789756   0.6117124

Discussion

Both r and regression coefficient (b1) allow to evaluate the association between X and Y. It’s easy to see that r and b1 always have the same sign, so both of them can be used as a good indicator for the direction of correlation Y~X (positive, negative).
When b1 is truly null (equal to zero), so will be r. So the statistical significance of both r and b1 can be tested by the t test. Indeed, the t values were the same for b1 and for r (t=6.267208), so were p_values.

cor.test(X,Y)

## 
##  Pearson's product-moment correlation
## 
## data:  X and Y
## t = 6.2672, df = 10, p-value = 9.295e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6539530 0.9697937
## sample estimates:
##       cor 
## 0.8927873

mod%>%summary()%>%.$coefficients

##               Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 11.4789756 4.09405157 2.803818 1.867185e-02
## X            0.6117124 0.09760525 6.267208 9.295117e-05

Despite that both Null hypotheses seem to be equivalent, those statistics should be interpreted in different ways: the null hypothesis for b indicates a null effect, while the null hypothesis for r corresponds to a null correlation.

The regression coefficient provides more information and ability of interpretation than the Pearson’s r. Therefore the linear regression could be considered a better way to verify the causative relationship between a response variable Y and a factor /covariate X. For instance, only b1 could be interpreted as: “Y would increase b1 unit for each increased unit of X”, while the r value can only measure the strength and direction of correlation.

Regression coefficient b and Pearson’s r are only equivalent when both Y and X are standardised (as we could see, they are linked by SdX and SdY)

cor.test(X,Y)

## 
##  Pearson's product-moment correlation
## 
## data:  X and Y
## t = 6.2672, df = 10, p-value = 9.295e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6539530 0.9697937
## sample estimates:
##       cor 
## 0.8927873

lm(scale(Y)~scale(X))%>%summary()%>%.$coefficients

##                 Estimate Std. Error      t value     Pr(>|t|)
## (Intercept) 4.275046e-16  0.1363891 3.134449e-15 1.000000e+00
## scale(X)    8.927873e-01  0.1424538 6.267208e+00 9.295117e-05

As we can see, the Pearson’s r is now EQUAL to the model’s b1 value Unlike b, The Pearson’s r will not be influenced by scale or unit of measurement. Be careful when we want to apply a Transformation (Box-Cox, for example) on response variable or predictor, because we can no more interpret the b value as it was in original scale.

It has been shown that the r value is more sensitive to the restriction on X, or the range of measurement. This could be seen on experiments when X value is controlled and manipulated by the study design but does not represent the full scale of X in real world (population). Regression coefficient is less influenced by such restriction. There fore a linear regression analysis is more appropriate than Pearson’s r for experiments. Regression analysis is also better for verify the potential effect of X on the estimation of Y, or for being interpreted as causative relationship.
When Y depends on more than one factor/covariate, these partial relationship could be better estimated by Pearson’s r (r will be reduced for each partial effect), but regression coefficient will not be affected. Pearson’s r is a true measure of the partial relationship, relative to other cofactor or covariates, while b coefficient measures the absolute size of relationship, by isolating other factors.

Reference:

Richard B. Darlington and Andrew F. Hayes. Regression Analysis and Linear Models, 2nd Ed. The Gulford Press (2017)
Julian J Faraway. Linear models with R. 2nd Ed. CRC Press (2017)

See you next time and Thank for joining us

ESMS-Case 5.2 Linear models. Part 2: Anatomy of a simple linear model

Lê Đông Nhật Nam

29 March 2017

Case study 5.2: Linear models Part 2: Anatomy of a simple linear model