Case study 5.1: Linear models Part 1: Algorithm, Estimation and Syntax

Foreword: The Elementary Statistics for Medical Students (ESMS) project

The ESMS project was initialised in February 2017 by a group of young Vietnamese statisticians. Our goal is to help Vietnamese medical students developing their skills in R statistical programming language by problem-based tutorials.

Introduction

Have you ever heard that regression models are breads and butters for scientific research ? The linear models provide a scalable, robust and polyvalent tool for practitioners to adapt their own analysis to the real world data with optimal efficiency and flexibility. By a simple research on Google scholar or Pubmed, we could find out that at least 50% of the published articles on any Medical journal have adopted regression models.

Basically, the linear regression is used to predict the value of a numeric quantity from the input data (Predictive study), However, modelling might also help us in making statistical inferences, evaluating the association between variables, assessing a factor’s role, or interpreting the result in a causative way…

Most of conventional methods involved the elements of linear regression, including analysis of variance or covariance, Student’s t-test, correlation analysis and repeated measures… Despite that these topics are usually introduced as separated tools, they all derived from the linear regression. By learning linear models, we can understand better the mechanism behind traditional methods. Linear model can also be considered as a key to unlock the doors to other interesting domains, likes Bayesian analysis or Machine learning.

Unfortunately, most of physicians find regression analysis too difficult for their daily works, due to the lack of explanation and practice. After carefully considering the difficulties on practicing linear models amongst the students, we decided that the core of our project will be devoted to introducing different aspects of the linear models.This complexe topic will be explored in 10 chapters, each one will focus on a specified subject, including: Estimation, statistical inference, prediction, performance evaluation, model diagnosis, resampling, specific distribution, function links and data transformation…

“The journey of a thousand miles begins with one step. - Lao Tzu”

The present tutorial is a setup for our long journey to discover the Linear models. We begin with a study question, then the R syntaxes of lm() function will be introduced. Later, we will discover how the least square algorithm works.

Study problem

In this chapter, we will use a simple study. The original data set was provided by Weisberg (1980) and Chambers et al.(1983). The study question is how to estimate the length of a heart catheter ?

The Cardiac catheterisation consists of introducing a catheter into a major vein or artery at the femoral region and moved into the heart. The proper length of the introduced catheter has to be estimated by physician. There are 4 doctors and each one suggested a solution for this problem:

The doctor A proposed using the patient’s height for predicting the catheter’s length.

The doctor B would like to predict the catheter’s length in terms of patient’s Weight

The doctor C suggested that the prediction could be more accurate if both Height and Weight are combined as predictors.

Doctor D don’t like the prediction, so he attemps to measure the catheter’s length in several patients in order to determine the mean of this variable.

To verify those hypothesis, 4 doctors recorded the true catheter’s length, height and weight in 12 patients who underwent heart catheterization.

Who made the right decision ?

Data preparation and exploration

library(tidyverse)

data(heart,package="robustbase")

heart=heart%>%as_tibble()
heart$clength=as.numeric(heart$clength)

heart$BMI=heart$weight/(0.0254*heart$height)^2

psych::describe(heart)
##         vars  n  mean    sd median trimmed   mad   min   max range  skew
## height     1 12 40.36 11.94  39.00   39.83  7.41 22.50 63.50 41.00  0.35
## weight     2 12 38.12 26.06  34.25   35.55 22.61  8.50 93.50 85.00  0.83
## clength    3 12 36.17  8.18  36.50   36.40  6.67 20.00 50.00 30.00 -0.12
## BMI        4 12 32.00  6.34  33.06   32.71  5.67 17.78 39.13 21.35 -0.74
##         kurtosis   se
## height     -0.66 3.45
## weight     -0.48 7.52
## clength    -0.64 2.36
## BMI        -0.45 1.83
shapiro.test(heart$clength)
## 
##  Shapiro-Wilk normality test
## 
## data:  heart$clength
## W = 0.97152, p-value = 0.926

The dataset could be loaded from the robustbase package. We performed a short descriptive analysis. The Shapiro-Wilk test result indicates that our response variable (clength) is normally distributed.

What is a model ?

Our goal is to predict the catheter’s length in terms of one or more predictors, including Weight, Height or BMI.

The simplest solution consists of approximating the clength by a constant, fixed value, as suggested by Doctor D. This constance could be determined as the mean of clength in 12 patients. This approximation does not take into account the random variation of clength so it could be incorrect for some individuals. This could be considered as a null model, i.e predicting using no model:

Model D (null model): clength ~ const (this model only includes an intercept)

The suggestions by doctors A and B could be translated into two simple linear models :

Model A: clength ~ function (Height) + error Model B: clength ~ function (Weight) + error

The predictive model as suggested by Doctor C could be expressed as:

Model C: Clength ~ function (Height, Weight) + error

All 4 doctors have the same goal: They must develop a function F such that F(X) allows them to approximate a predicted value of Clength, regardless what variable X could be. For now, this function is unknown. The error terms indicate the prediction error for each model.

As clength could vary in a large dimensional space, describing correctly the function F is very difficult. However, the model C can also be presented in a restricted form:

Clength ~ B0 + B1Height + B2Weight + Error

Where B0, B1 and B2 are regression coefficients and B0 is called the intercept.

By presenting our model in this form, we have converted our problem from determining a continuous function F to a simplified problem of estimating the values of B (beta parameters). Now, our goal is to determine the beta coefficients that fit the predicted values as close to true value as possible.

The beta coefficient for a specific predictor Xi indicates a linear association between Xi and predicted Clength, as Clength will be changed by Beta units for each unit changed in Xi, while holding all other Xs constant.

The term “linear model” indicates neither linearity of covariates (predictors) in the model, nor the linear shape of the regression curve. It simply means that the beta coefficients are entered linearly (multiplying directly Beta by covariate).

How to build a model ?

A model’s content could be inspired from a physical law; for example the cardiac output could be modelled by the Fick’s law. The measurement scale of response variable might also suggest what kind of variable that could be used as predictor; for example in our experiment, the height would be a potential predictor for catheter length, as they are both measured as length (in cm) and the subject’s height would be proportional to the length of his/her veins.

We can also get our model by adopting or extending a prior model. This method is frequently used in Lung physiology, as the predictive model for lung volumes usually include Age, Height and Gender as the main predictors. Another reason for adopting those parameters, because they could be easily measured in any subect without using any specific instrument.

In the worst case when we have a complex data structure with no prior hypothesis on model’s content, different methods should be applied to select the best-fit model based on their performance (accuracy of prediction).

For interpretive purposes, our goal is no more predicting the response variable, so any multivariate model could be used as a map to evaluate the patterns between response and dependent variables.

How the Least square algorithm works ?

Let’s return to our dataset: this dataframe is presented as a matrix of C column and N rows where each column represents a variable and each row represents an observation (aka case or instance).

heart%>%knitr::kable()
height weight clength BMI
42.8 40.0 37 33.84582
63.5 93.5 50 35.94154
37.5 35.5 34 39.12897
39.5 30.0 36 29.80298
45.5 52.0 43 38.93257
38.5 17.0 28 17.77706
43.0 38.5 37 32.27427
22.5 8.5 20 26.02474
37.0 33.0 34 37.36311
23.5 9.5 30 26.66370
33.0 21.0 38 29.88987
58.0 79.0 47 36.40019

clength is the column that would be treated as Response variable (we recommend not use the terms dependent variable, so that student could better adapt their knowledges to Machine learning methods, as sooner or later they will go there). The remaining columns represent the potential predictors (not all of them could be used for modelling).

Given an observed case (i = individual data), the predicted value of Clength(i) could be estimated by:

Clength(i) = B0 + B1* Height(i) + B2*Weight(i) + error(i)

Our estimation could be generalized in Matrix and Vectors as follows:

y = BX + E

Where y is a vector of response variable, X is a matrix of predictors, BX is a component that allows to explain MOST of Y’s variance and E is a random error term that corresponds to the residual part of Y’variance that BX failed to explain.

#This is the response

Y=heart$clength
head(Y)
## [1] 37 50 34 36 43 28
# Y is a vector
is.vector(Y)
## [1] TRUE
#This is a matrix of predictors

X=model.matrix(~height+weight,data=heart)
is.matrix(X)
## [1] TRUE
head(X)
##   (Intercept) height weight
## 1           1   42.8   40.0
## 2           1   63.5   93.5
## 3           1   37.5   35.5
## 4           1   39.5   30.0
## 5           1   45.5   52.0
## 6           1   38.5   17.0

Both Y and X are in N dimensional spaces whilst B lies in a space of P dimension, where P is the number of predictors (it should be noted that B also includes the B0 component or intercept, so P = number of depdendent variables plus 1, for example if our model includes only Height, p will be 2).

Our problem now is to determine the optimal value of B so that XB is as close to Y as possible. This optimal B value is called B_hat or regression coefficients.

The fitted or predicted value is called Y_hat:

Y_hat = B_hat*X + E_hat

Where E_hat indicates the residual error.

We could say that the linear regression algorithm aims to reduce the number of dimension of data structure, from N dimensional data to P dimensional data by determining an optimal value of B so that most variations of Y (that was originally determined by a N dimensional space) could be successfully located in a smaller P dimensional space.

The remaining random variation (residual error = E) is the variation of Y that falls outside the fitted dimension, it could be considered belonging to another space of (N-P) dimensions. (N-P) is called model’s degree of freedom

This is a density curve of our response variable (clength). By assuming that clength is characterised by a Gaussian distribution, predicted clength could take any value within the interval from 20 to 50 cm. We use an interactive graph so you can verify this by moving the cursor along the X axis.

library(plotly)

p1=heart%>%ggplot(aes(x=clength))+geom_histogram(aes(y = ..density..),alpha=0.6,fill="yellow",color="black")+geom_density(fill="purple",color="black",alpha=0.6)+theme_bw()

p1%>%ggplotly()

In our dataset, data is determined by a 12 dimensional space.

A simple linear model of clength ~ B0+B1*Height reduced the predicted data to a 2 dimensional space (Height and intercept), as demonstrated on this figure:

Despite that we only have 12 observed values (purple points), the model’s regression curve makes impression of a continuity of the variation of clength as a function of height. Again, you can check this with cursor

p2=heart%>%ggplot(aes(x=height,y=clength))+geom_point(shape=21,fill="purple",color="black",alpha=0.6,size=3)+geom_smooth(method="lm",color="red",alpha=0.6,se=F,linetype="longdash")+theme_bw()
p2%>%ggplotly()

In a similar way, the model clength ~ B0 + B1*Weight has reduced the number of dimension of data to 2:

p3=heart%>%ggplot(aes(x=weight,y=clength))+geom_point(shape=21,fill="gold",color="black",alpha=0.6,size=3)+geom_smooth(method="lm",color="red",alpha=0.6,se=F,linetype="longdash")+theme_bw()
p3%>%ggplotly()

The model with BMI did the same thing, however BMI seems not to be a good predictor, as the points are largely discarted from the regression line

p4=heart%>%ggplot(aes(x=BMI,y=clength))+geom_point(shape=21,fill="red",color="black",alpha=0.6,size=3)+geom_smooth(method="lm",color="red",alpha=0.6,se=F,linetype="longdash")+theme_bw()
p4%>%ggplotly()

Finally, the model that imply both height and weight as predictors would reduce the distribution of data to a space of 3 dimensions, as follows:

You can rotate the 3D scatter plot to detect a linear pattern of our points.

plot_ly(data=heart, x = ~height, y = ~weight, z = ~clength,sizes=1)%>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'Height'),
                      yaxis = list(title = 'Weight'),
                      zaxis = list(title = 'Cath.Length')))

As mentioned above, our goal is now to determine the B values. To explain how the least square algorithm can do this, we borrow some codes from the Faraway’s book. This is a very good monograph on linear regression. You can find out about this via the author’s website

http://people.bath.ac.uk/jjf23/LMR/index.html

The B values could be estimated through the calculation on matrices : First, the original X matrix is transposed into a hat matrix (H). Then the predicted value or Y hat, as well as the residual errors will be combined through H to determine the sum of squared errors (RSS = residual sum of squares). B is resolved from the normal equation:

X^TX.B hat = X^Ty

so that RSS = y^T (I – H) y would be as low as possible.

#Solution 1: via matrix transposing (XTX)

XTX=solve(t(X)%*% X)
XTX %*% t(X) %*% Y
##                   [,1]
## (Intercept) 20.3757645
## height       0.2107473
## weight       0.1910949
#Solution 2: using crossproduct of matrices

solve(crossprod(X,X),crossprod(X,Y))
##                   [,1]
## (Intercept) 20.3757645
## height       0.2107473
## weight       0.1910949

Both solutions provide the same value of B coefficients, including the intercept. However, this job can be done easier using directly the lm() function

modFull=heart%>%lm(data=.,clength~height+weight)

modFull$coefficients
## (Intercept)      height      weight 
##  20.3757645   0.2107473   0.1910949

We can see that the output of the lm() function is exactly the same as the results obtained by manual estimation.

lm() function

The syntax for developing a linear regression model in R is lm( ). The lm( ) function requires following arguments:

This command could be interpreted as: let’s fit a linear model to predict the value of Y in function of predictors from dataframe D, then store the model’s output to object M

Lm is based on “least square regression” algorithm, it will solve the matrices to find the best beta values that minimizes the sum of squares of the residual (RSS)

Formula syntax

(Y ~ 1) indicates a null model (doctor D’s model), this model contains only intercept

mod0=heart%>%lm(data=.,clength~1)
mod0%>%summary()
## 
## Call:
## lm(formula = clength ~ 1, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.1667  -3.1667   0.3333   3.0833  13.8333 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   36.167      2.361   15.32 9.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.178 on 11 degrees of freedom
psych::describe(heart$clength)
##    vars  n  mean   sd median trimmed  mad min max range  skew kurtosis
## X1    1 12 36.17 8.18   36.5    36.4 6.67  20  50    30 -0.12    -0.64
##      se
## X1 2.36

We could see that the Intercept and SE of intercept in a null model are respectively the Mean and SE of clength in 12 patients. So there’s no model, we simply used the mean of clength as a fixed predicted value.

Y ~ X indicates a simple or univariate model, or a model that includes only 1 predictor

modB=heart%>%lm(data=.,clength~BMI)
modB%>%summary()
## 
## Call:
## lm(formula = clength ~ BMI, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2075  -2.9561   0.8476   3.5984  10.5672 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   9.6216    10.1831   0.945   0.3670  
## BMI           0.8294     0.3126   2.653   0.0242 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.571 on 10 degrees of freedom
## Multiple R-squared:  0.4131, Adjusted R-squared:  0.3544 
## F-statistic:  7.04 on 1 and 10 DF,  p-value: 0.02418
modH=heart%>%lm(data=.,clength~height)
modH%>%summary()
## 
## Call:
## lm(formula = clength ~ height, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.0299 -0.6908 -0.2175  1.1908  6.3345 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.47898    4.09405   2.804   0.0187 *  
## height       0.61171    0.09761   6.267  9.3e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.864 on 10 degrees of freedom
## Multiple R-squared:  0.7971, Adjusted R-squared:  0.7768 
## F-statistic: 39.28 on 1 and 10 DF,  p-value: 9.295e-05
modW=heart%>%lm(data=.,clength~weight)
modW%>%summary()
## 
## Call:
## lm(formula = clength ~ weight, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7570 -1.5376 -0.2054  2.0043  6.6946 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 25.34409    1.92834  13.143 1.24e-07 ***
## weight       0.28387    0.04232   6.707 5.32e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.658 on 10 degrees of freedom
## Multiple R-squared:  0.8181, Adjusted R-squared:    0.8 
## F-statistic: 44.99 on 1 and 10 DF,  p-value: 5.317e-05

(Y ~ X1 * X2) indicates a model that includes interaction effect between X1 and X2

modInt=heart%>%lm(data=.,clength~height*weight)
modInt%>%summary()
## 
## Call:
## lm(formula = clength ~ height * weight, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4645 -1.8159 -0.4652  1.3364  6.7443 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   18.567582   8.410597   2.208   0.0583 .
## height         0.147821   0.344757   0.429   0.6794  
## weight         0.523401   0.332306   1.575   0.1539  
## height:weight -0.004591   0.004054  -1.132   0.2903  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.72 on 8 degrees of freedom
## Multiple R-squared:  0.8495, Adjusted R-squared:  0.793 
## F-statistic: 15.05 on 3 and 8 DF,  p-value: 0.001185

Finally, a model without intercept could be built with the syntax:

(Y ~ X1 + X2 - 1)

modFull=heart%>%lm(data=.,clength~height+weight)
modFull%>%summary()
## 
## Call:
## lm(formula = clength ~ height + weight, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7419 -1.2034 -0.2595  1.8892  6.6566 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  20.3758     8.3859   2.430    0.038 *
## height        0.2107     0.3455   0.610    0.557  
## weight        0.1911     0.1583   1.207    0.258  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.778 on 9 degrees of freedom
## Multiple R-squared:  0.8254, Adjusted R-squared:  0.7865 
## F-statistic: 21.27 on 2 and 9 DF,  p-value: 0.0003888
modH2=heart%>%lm(data=.,clength~height+weight-1)
modH2%>%summary()
## 
## Call:
## lm(formula = clength ~ height + weight - 1, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0162 -1.4512 -0.5302  1.6387  7.2770 
## 
## Coefficients:
##        Estimate Std. Error t value Pr(>|t|)    
## height  1.02629    0.10020  10.243 1.28e-06 ***
## weight -0.14681    0.09224  -1.592    0.143    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.613 on 10 degrees of freedom
## Multiple R-squared:  0.9871, Adjusted R-squared:  0.9845 
## F-statistic: 381.2 on 2 and 10 DF,  p-value: 3.639e-10

This’s the end of this chapter. We have built every possible models corresponding to the hypothesis of 4 doctors. Only one of them could be considered as the best answer for our study question. How could we make our decision ? The answer will be provided in the next chapter. By the end of our journey, you will discover everything about the linear model.

See you next time and Thank for joining us