BIO2POS Lecture Topic 5B

class: middle
background-image: url(data:image/png;base64,#LTU_logo_clear.jpg)
background-position: top left
background-size: 25%

# BIO2POS 
# Multiple Linear Regression and Regression Assumptions
## Data Analysis Topic 5B
### La Trobe University

---

# Welcome!

### In this lecture we will extend our coverage of linear regression modelling, and discuss multiple linear regression, and the assumptions we make when using linear regressions.

Over the following slides, we will cover:

* .orangered_style[Multiple Linear Regression]
    
--

* Construction of the model
    
--

* Interpretation of the model
    
--

* .orangered_style[Linear Regression Assumptions]
  
--

* Linearity
    
--

* Constant Variance of Residuals
    
--

* Normality of Residuals

---

# Intended Learning Objectives

### By the end of this lecture you will:

* understand the concepts involved in fitting a .orangered_style[Multiple Linear Regression]
  
--

* be able to correctly .seagreen_style[interpret] and .seagreen_style[summarise] the results of a Multiple Linear Regression
  
--

* understand and be able to check the .orangered_style[model assumptions] for Simple and Multiple Linear Regressions
  
--

The content you learn in Topics 5A and 5B will be beneficial for your analyses of numeric data.

We will practice content from this topic in this week's DA computer lab, and the computer lab has some additional extension material if you would like to extend your knowledge.

---

# Associations between Numeric Variables

In [Topic 5A](https://rpubs.com/LTU_BIO2POS/DA5A) we introduced .orangered_style[Simple Linear Regression], which we can use to model the .seagreen_style[linear relationship] between two numeric variables.

While this may be sufficient in some scenarios, often the phenomenon we are trying to model may be complex, with multiple variables contributing to the results we observe.

* As part of our study, we may have collected data on more than two variables

We can treat our Simple Linear Regression model as a starting point, and add additional independent variables to develop a .orangered_style[Multiple Linear Regression].

---

# Multiple Linear Regression

.orangered_style[Multiple Linear Regression] is a generalisation of Simple Linear Regression.

* We are simply adding additional independent variables to our model, in the expectation that this will improve the fit of the model
  
--

Suppose we have observations of a dependent variable and `$p$` independent (aka predictor) variables `$(p \geq 2)$`, for `$n$` individuals.

* E.g. for our .seagreen_style[Chinstrap penguin example], we have `$n = 68$` observations of the dependent variable `body mass`, and the `$p=3$` independent variables 
  
    * `flipper length`, 
    
    * `bill depth` and 
    
    * `bill length`

---

# Checking Correlation

Before we construct our .orangered_style[Multiple Linear Regression], we may like to check the .orangered_style[Pearson's Correlation] between the dependent variable and each of the independent variables being considered.

In our .seagreen_style[Chinstrap penguin example] data set, the independent variables all appear to be statistically significantly positively correlated with body mass, with `$p < .001$` (which is not really surprising).

---

# Multiple Linear Regression

The Multiple Linear Regression Model is:

`$$\quad Y = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \ldots + \beta_p x_{p} + \epsilon$$`

When fitting a multiple linear regression, there will be a trade-off between accuracy and complexity.

* As we add independent variables, the model becomes more complex
  
--

* If this complexity is not matched by an improvement in the fit of the model, should we include the additional independent variable(s)?

---

# `$R^2$` and Adjusted `$R^2$`

Recall that we introduced the `$R^2$` value in [Topic 5A](https://rpubs.com/LTU_BIO2POS/DA5A). `$R^2$` measures the percentage of variation in the dependent variable that is explained by the independent variables.

As we add more independent variables to our model, the `$R^2$` term will increase.

* Sometimes this increase is artificial rather than meaningful

For .orangered_style[Multiple Linear Regression] models, we should refer to the **adjusted** `$R^2$` rather than to `$R^2$`.

* We can use the same guidelines for interpreting `$R^2$` and adjusted `$R^2$` values
  
--

* The adjusted `$R^2$` equation includes a component to ensure that the value will only increase when a new variable is added, if that variable is genuinely beneficial to the model fit
  
--

* Often the `$R^2$` and adjusted `$R^2$` values are similar, but sometimes they can be quite different!

---

# Multiple Linear Regression - Penguin Example

In [Topic 5A](https://rpubs.com/LTU_BIO2POS/DA5A) we fitted a simple linear regression model to .seagreen_style[Chinstrap penguin] data from Horst et al. (2020).

* We modelled `body mass` `$(y)$` against `flipper length` `$(x)$`
  
--

Our model had a moderate fit, with `$R^2 = 0.412$`, and showed that flipper length was a .orangered_style[significant predictor] of body mass.

Now, using the .orangered_style[Multiple Linear Regression] framework, we would like to extend this model, and also include the numeric variables:

* `bill length` (mm) and
  
  * `bill depth` (mm)

We will add these one at a time, to highlight changes in the model.
  
---

# Multiple Linear Regression - Penguin Example

As a refresher, this was our original simple linear regression model output:

Our estimated model was

`$$\widehat{\text{body mass}} =  -3037.196 + 34.573 \times \text{flipper length}$$`

---

# Multiple Linear Regression - Penguin Example

If we add the independent variable `bill length`, we obtain:

Our estimated model becomes

`$$\widehat{\text{body mass}} =  -3211.947 + 27.675 \times \text{flipper length} + 31.243 \times \text{bill length}$$`

---

# Multiple Linear Regression - Penguin Example

If we add both `bill length` and `bill depth`, we obtain:

Our estimated model becomes

.longequation_style[
`$$\widehat{\text{body mass}} =  -3157.530 + 22.580 \times \text{flipper length} + 16.039 \times \text{bill length} + 91.513 \times \text{bill depth}$$`
]

---

# Multiple Linear Regression - Penguin Example

Let us discuss these results.

The fit of the model appears to have improved:

* `$R^2 = 0.412$` adjusted `$R^2 = 0.453$` adjusted `$R^2 = 0.481$`

The `$\beta$` estimates have changed for each model, for `$\hat{\beta_0}, \ldots , \hat{\beta_p}$`.

* This is normal - as we add information to our model, the contribution of each independent variable to the value of the dependent variable will be considered in the context of the other independent variables included

* Remember, the `$\hat{\beta}$` terms are estimates, based on the data provided
---

# Multiple Linear Regression - Penguin Example

You may have noticed that initially, the inclusion of `bill length` seemed helpful:

* The `bill length` coefficient `$\hat{\beta}_2 = 31.243$`, with `$p = 0.010 < 0.05$` 
 
 * This suggests bill length is clearly a significant predictor of body mass
 
--

However, when we also added `bill depth`, something strange happened - the `bill length` coefficient halved `$(\hat{\beta}_2 = 16.039)$` and became statistically non-significant `$(p = 0.241 > 0.05)$`!

When we add multiple independent variables to our model, and assess them together, they can .seagreen_style[influence each other]. Independent variables may be highly correlated, which raises the issue of .orangered_style[Multicollinearity] (linear dependence).

---

# Deciding which variables to keep

There are numerous model selection techniques (e.g. *Stepwise Regression*) and statistical criterion (e.g. .orangered_style[AIC - Akaike's Information Criterion]) which we could use to help determine which variables to include in our model.

Each technique and criterion has pros and cons, and an in-depth coverage of the options available is beyond the scope of .seagreen_style[BIO2POS].

* We could easily spend an entire semester on model selection

As a general guide, I suggest the following:

* If in doubt, use the full model, with all independent variables included
  
    * Report the `$p$`-values for all `$\beta$` estimates, and make a note of which are statistically significant/non-significant
  
--

* Removing variables from your model can have unintended consequences
 
 * E.g. invalidating subsequent statistical inference regarding `$\beta$` estimates

---

# Model Selection with AIC

We can perform limited model selection in jamovi, using the .navy_style[Model Builder] tool.

* Smaller .orangered_style[AIC] values indicate better model performance - we want a trade-off between model accuracy and model complexity
  
---

# Interpreting output - Penguin Example

When interpreting the `$\beta$` value for an independent variable in our fitted model, we treat the other independent variables as being .orangered_style[fixed].

* Fixing a variable is often referred to as *controlling for* or *adjusting for* that variable

For example, from our estimated model:

.longequation_style[
`$$\widehat{\text{body mass}} =  -3157.530 + 22.580 \times \text{flipper length} + 16.039 \times \text{bill length} + 91.513 \times \text{bill depth}$$`
]

we estimate that on average, each one unit (mm) increase in bill depth leads to a 91.513 grams increase in the body mass of .seagreen_style[Chinstrap penguins], *controlling for* flipper length and bill length.

* *Note that if we were concerned about strong correlations between independent variables, we could e.g. create a  weighted group effect term, based on the relevant independent variables, but this is beyond the scope of .seagreen_style[BIO2POS]*
  
---

# MLR Summary - Penguin Example

.smidgesmaller_style[
A .orangered_style[multiple linear regression] model was fitted to data on `$n=68$` .seagreen_style[Chinstrap penguins] from Dream Island in the Palmer Archipelago in Antarctica.

Penguin body mass (grams) was regressed against flipper length (mm), bill length (mm) and bill depth (mm).
]
--

.smidgesmaller_style[
The fitted MLR model was:
]

.longequation_style[
`$$\widehat{\text{body mass}} =  -3157.530 + 22.580 \times \text{flipper length} + 16.039 \times \text{bill length} + 91.513 \times \text{bill depth}$$`
]

.smidgesmaller_style[
Flipper length was a .orangered_style[significant predictor] of body mass `$(\hat{\beta}_1 =22.580, p<.001)$`. 
We estimate that on average, a 1 mm increase in flipper length leads to a 22.580 grams increase in Chinstrap penguin body mass, *controlling for* bill length and depth.]

.smidgesmaller_style[
Bill depth was also a .orangered_style[significant predictor] of body mass `$(\hat{\beta}_3 = 91.513, p = 0.038)$`. We estimate that on average, a 1 mm increase in bill depth leads to a 91.513 grams increase in Chinstrap penguin body mass, *controlling for* flipper and bill length.]

.smidgesmaller_style[Bill length was not a significant predictor of body mass `$(\hat{\beta}_2 = 16.039, p = 0.241)$`.
]

---

# MLR Summary - Penguin Example

We could also consider reporting the confidence intervals calculated for the different `$\beta$` estimates. E.g.:

For each 1 mm increase in flipper length, we estimate that on average the body mass of a .seagreen_style[Chinstrap penguin] will typically increase by between 10.809 grams and 34.351 grams, *controlling for* bill length and depth.

* Note that the range of this confidence interval covers only positive values
  
--

For the bill length variable, which was found to not be a significant predictor of body mass, we note that the `$95\%$` confidence interval for `$\hat{\beta}_2$` is `$(-11.015, 43.093)$`.

* Note that this confidence interval contains 0, suggesting that the bill length changes may not have an impact on body mass
---

# Linear Regression Assumptions

When we construct a linear regression model, we make several assumptions, just like the other statistical techniques we have covered in previous topics.

So far, we have focused on the model fitting and interpretation aspects of linear regression.

Over the following slides, we will cover the key linear regression assumptions.

### Key Assumptions

* .orangered_style[Linearity]
  
--

* .orangered_style[Constant Variance of Residuals]
  
--

* .orangered_style[Normality of Residuals]

For simple and multiple linear regression, we are assuming that `$\epsilon_i \sim N(0, \sigma^2)$`, for `$i = 1, 2, \ldots, n$`.

---

# Linearity Assumption

In our linear regression modelling, we assume that there exists a linear relationship between our .navy_style[dependent variable], and each .orangered_style[independent variable] included in the model.

This is simple to check - we can assess scatter plots of the variables, and look for patterns in the residuals plots for further evidence.

* If we observe curvature in the scatter plot, we can technically account for this by transforming the independent variable, but this is beyond the scope of .seagreen_style[BIO2POS]

---

# Residuals

Recall that a residual `$r_i$` is an estimate of the true error `$\epsilon_i$`, for individual `$i$`.

For a simple linear regression, we have:

`$$r_i = y_i - \hat{y}_i$$`

We can visualise this as the vertical distance between the observed value for individual `$i$`, and the corresponding estimated value on the fitted regression line.

For a multiple linear regression model, we have:

`$$r_i = y_i - (\hat{\beta_0} + \hat{\beta_1} x_{1i} + \hat{\beta_2} x_{2i} + \ldots + \hat{\beta_p} x_{pi})$$`

which we cannot visualise as easily.

---

This plot shows the process of computing residuals for a simple linear regression.

---

# Constant Variance of Residuals

We assume that the residuals of our linear regression model display .orangered_style[constant variance] (aka homoskedasticity).

* I.e. we expect residuals to be spread across all the fitted `$\hat{y}$` values similarly
  
--

* If we plot the residuals (on the y-axis) against the fitted `$\hat{y}$` values (on the x-axis), we expect to observe a random scatter of points above and below the horizontal line `$y=0$`
 
 
<img src="data:image/png;base64,#chinstrap_mlr_residuals_vs_fitted.png" width="425px" style="display: block; margin: auto;" />

---

# Homoskedasticity Violations

If we observe any patterns in our residuals vs fitted values plot, this is evidence of a violation of our constant variance of residuals assumption.

Patterns could include:

* Fanning

---

# Homoskedasticity Violations

If we observe any patterns in our residuals vs fitted values plot, this is evidence of a violation of our constant variance of residuals assumption.

Patterns could include:

* Curvature
 
<img src="data:image/png;base64,#crab_curve_test.png" width="450px" style="display: block; margin: auto;" />

---

# Homoskedasticity Violations

If we observe any patterns in our residuals vs fitted values plot, this is evidence of a violation of our constant variance of residuals assumption.

Patterns could include:

* Both Fanning and Curvature! 
 
<img src="data:image/png;base64,#crab_terrible_resvsfits.png" width="450px" style="display: block; margin: auto;" />

---

# Checking Normality

We also assume that the residuals from our linear regression model follow a normal distribution.

Recall that we learnt several methods for checking the normality of residuals in [Topic 3A](https://rpubs.com/LTU_BIO2POS/DA3A).

The good news is that we can use those same methods here, for linear regression:

### How to check

* .orangered_style[Histogram of Residuals] with normal/density curve overlaid
  
--

* .orangered_style[Normal Q-Q Plot of Residuals]
  
--

* Formal statistical test: .orangered_style[Shapiro-Wilk test]

Remember, we are assessing the residuals from our model, not the original data.

---

# Residuals Normality Check - Penguin Example

.left-column[

* The histogram of residuals looks quite good
]

.right-column[
<img src="data:image/png;base64,#chinstrap_mlr_residuals_hist.png" width="475px" style="display: block; margin: auto;" />

]

---

# Residuals Normality Check - Penguin Example

.left-column[

* The Q-Q plot also appears reasonable
]

.right-column[
<img src="data:image/png;base64,#chinstrap_mlr_residuals_qq.png" width="475px" style="display: block; margin: auto;" />

]

---

# Residuals Normality Check Note

As a counterexample, a Normal Q-Q plot like this would raise concerns:

---

# Residuals Normality Check - Penguin Example

.pull-left[
<img src="data:image/png;base64,#chinstrap_sw.jpg" width="350px" style="display: block; margin: auto;" />

]

.pull-right[
  * Recall that the Shapiro-Wilk test is testing for normality of the data
  {{content}}
]

* A `$p$`-value `$< 0.05$` suggests the data is non-normal
{{content}}

* We cannot reject `$H_0$`, since `$p = 0.989 > 0.05$`, so conclude the residuals are normally distributed

.center[
It would appear overall that our regression model assumptions have been met for our .seagreen_style[Chinstrap penguins example]!
]

---

# Summary

We can use .orangered_style[Multiple Linear Regression (MLR)] to assess and model the strength and direction of the linear associations between a dependent variable and two or more independent variables.

* The MLR model is `$y = \beta_0 + \beta_1x_1 + \beta_2x_2  + \ldots + \beta_px_p  + \epsilon$`
  
--

* The adjusted `$R^2$` value should be used instead of the `$R^2$` value when checking the quality of the fit of the MLR
  
--

* MLR model interpretations are 'on average'
  
--

* When interpreting the impact of an independent variable on our dependent variable, we assume the values of other independent variables remain fixed

* SLR and MLR assumptions include:
  
    * Linearity
    
    * Constant Variance of Residuals
    
    * Normality of Residuals
---

# End

That concludes our lecture on multiple linear regression.

### What to do next:

* .seagreen_style[Quick Kahoot revision quiz]: Please go to [kahoot.it](kahoot.it) and type in the code shown

* Make sure to attend this week's DA computer lab

* If you have any questions, check the LMS, email us or ask in the computer labs

### Optional Further Reading

* Parts from Kokoska (2020) Chapters 12
  
---

# References

*  Green, P. T. (1997). Red crabs in rain forest on Christmas Island, Indian Ocean: activity patterns, density and
biomass. *Journal of Tropical Ecology*, 13(1), 17-38

* Horst,  A.M., Hill, A.P. and Gorman, K.B. (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R
  package version 0.1.0. [https://allisonhorst.github.io/palmerpenguins/](https://allisonhorst.github.io/palmerpenguins/). doi: 10.5281/zenodo.3960218.

* Kokoska, S. (2020). Introductory statistics: a problem-solving approach (Third edition..). W H FREEMAN.

* The jamovi project. (2022). *Jamovi [Computer Software]*. [https://www.jamovi.org](https://www.jamovi.org).

---
class: middle

These notes have been prepared by Rupert Kuveke, Amanda Shaker, and other members of the Department of Mathematical and Physical Sciences. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with the Department of Environment and Genetics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License 
<a href = "https://creativecommons.org/licenses/by-nc-nd/4.0/" target="_blank"> BY-NC-ND. </a>