BIO2POS Lecture Topic 5A

class: middle
background-image: url(data:image/png;base64,#LTU_logo_clear.jpg)
background-position: top left
background-size: 25%

# BIO2POS 
# Correlation and Simple Linear Regression
## Data Analysis Topic 5A
### La Trobe University

---

# Welcome!

### In this lecture we will discuss how to test for linear association between numeric variables, using correlation and simple linear regression analyses.

Over the following slides, we will cover:

* .orangered_style[Correlation]
    
--

* Pearson's Correlation
--

* Spearman's Correlation
    
--

* .orangered_style[Simple Linear Regression]
  
--

* The Simple Linear Regression equation
    
--

* Interpreting regression output

---

# Intended Learning Objectives

### By the end of this lecture you will:

* understand the differences between Pearson and Spearman Correlation, and know when each is appropriate to use

* understand the concepts involved in fitting a .orangered_style[Simple Linear Regression]
  
--

* be able to correctly .seagreen_style[interpret] and .seagreen_style[summarise] results produced by the above techniques
  
--

The content you learn in Topics 5A and 5B will be beneficial for your analyses of numeric data.

We will practice content from this topic in this week's DA computer lab, and the computer lab has some additional extension material if you would like to extend your knowledge.

---

# Associations between Numeric Variables

When we have data for two or more .seagereen_style[numeric variables], it is natural to ask:

`$$\text{Is there some sort of relationship or association between these variables?}$$`

As an initial check, we can use the statistical measure .orangered_style[correlation] to measure the level or degree of association between two numeric variables.

* Correlation is a standardized measure, and can take values from `$-1$` to `$1$`

We will cover two correlation measures:

* .orangered_style[Pearson's Correlation]

* .orangered_style[Spearman's Correlation]

---

# Pearson's Correlation

When we first assess our numeric data, we will often produce scatter plots as part of a descriptive statistics analysis.

If our data appears to have a linear relationship, then we can assess the strength of this relationship using .orangered_style[Pearson's Correlation] (aka Pearson's Correlation Coefficient).

* We denote the population Pearson's Correlation using `$\rho$`
  
--

* We denote the sample Pearson's Correlation using `$r$`
  
--

Both `$\rho$` and `$r$` can take values from `$-1$` to `$1$` inclusive, with the size of the correlation denoting the strength of the linear association:

* `$0$` denotes no correlation (the variables are completely unrelated)
  
--

* `$-1$` denotes a perfect negative linear correlation
  
--

* `$1$` denotes a perfect positive linear correlation

Let us take a look at some examples, to help clarify Pearson's Correlation.

---

# Pearson's Correlation Examples

---

# Correlation Examples - Crab Data

Recall the .seagreen_style[Christmas Island Red Crab] data (Green, 1997) we have analysed in previous DA computer labs.

.left-column[

* If we compare the numeric variables `LEG` and `CLAW`, it looks like they have a clear linear relationship

* Using Pearson's Correlation here is appropriate

]

.right-column[
<img src="data:image/png;base64,#crab_leg_claw_scatter.jpg" width="500px" style="display: block; margin: auto;" />
]

---

# Correlation Examples - Crab Data

.left-column[

* If we compare the numeric variables `CW` and `WEIGHT` however, they have a clear **non-linear** relationship

* Using Pearson's Correlation here is **not** appropriate

]

.right-column[
<img src="data:image/png;base64,#crab_cw_weight_scatter.jpg" width="500px" style="display: block; margin: auto;" />
]

---

# Correlation Examples - Crab Data

.left-column[

* Note that even accounting for `SEX` differences, straight lines do not fit the data well

]

.right-column[
<img src="data:image/png;base64,#crab_cw_weight_scatter_lines.jpg" width="500px" style="display: block; margin: auto;" />
]

---

# Spearman's Correlation

If our data for our two numeric variables exhibit a non-linear relationship, then we **should not** use the .orangered_style[Pearson's Correlation] measure.

Instead, we can use the .orangered_style[Spearman's Correlation] (aka Spearman's Rank Correlation Coefficient). This is a good option for data which is:

* Non-linear
  
--

* Skewed
  
--

* Monotonic (i.e. strictly increasing or strictly decreasing)

To differentiate between the Pearson's and Spearman's correlations, we can use:

* `$\rho_S$` to denote the population Spearman's Correlation
  
--

* `$r_S$` to denote the sample Spearman's Correlation

---

# Correlation Hypotheses and jamovi output

Rather than relying solely on visual inspection, we can conduct a formal test to determine if the population correlation is non-zero.

Our null and alternative hypotheses will be:

`$$H_0: \rho \text{ (or } \rho_S \text{)} = 0 \text{ vs } H_1: \rho \text{ (or } \rho_S\text{)} \neq 0$$`
.pull-left[
<img src="data:image/png;base64,#crab_claw_leg_correlation.jpg" width="350px" style="display: block; margin: auto;" />
]

.pull-right[
<img src="data:image/png;base64,#crab_weight_cw_correlation.jpg" width="600px" style="display: block; margin: auto;" />
]

---

# Correlation Notes

You may have heard the expression

.center[
**Correlation** does not imply **Causation**
]

Even if two numeric variables are highly correlated, this does not necessarily mean they are related, or that changes in one are driving changes in the other.

* Sometimes, this could be a random .seagreen_style[coincidental occurrence]
  
--

* We should be careful of making .orangered_style[Spurious Correlations]
  
--

* Often, the correlation may be due to a third, .seagreen_style[confounding variable]
  
    * E.g. a high (but spurious) correlation between mineral exports and tourism numbers in Australia could be due to exchange rate fluctuations

---

# Spurious Correlation Example - Kat Burglar?

.caption_style[
[Spurious correlations](https://www.tylervigen.com/spurious-correlations) by [Tyler Vigen](https://tylervigen.com/about-spurious-correlations), licenced under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
]

---

# Correlation and Linear Regression

Correlation analysis is often used as a precursor to a .orangered_style[Simple Linear Regression] (SLR) analysis.

In a regression analysis, there is an assumed (or hypothesised) functional dependence of one variable on the other variable(s) used in the model.

We can use regression models to mathematically explain relationships between variables, and to predict values of variables, given some inputs.

We will introduce the notation used in a .orangered_style[Simple Linear Regression] via an example, and then expand upon this notation in .seagreen_style[Topic 5B].

---

# Linear Regression - Penguin Example

.pull-left[

**Scenario:** .seagreen_style[Chinstrap penguins] (*Pygoscelis antarcticus*) live on several islands in the Antarctic Oceans.

* Horst et al. (2020) collected data on various physical characteristics of 68 Chinstrap penguins living on Dream island

* Suppose we would like to model the penguins' `flipper length` (mm) against their `body mass` (grams)
]

.pull-right[

.center[
.caption_style[
Note. From File:Pygoscelis antarctica DT -AQ Barrientos- (5) (20683924250).jpg, by [Diego Tirira](https://www.flickr.com/people/120935793@N02), 2010, Wikimedia Commons ([https://commons.wikimedia.org/](https://commons.wikimedia.org/)). [CC BY-SA 2.0 DEED](https://creativecommons.org/licenses/by-sa/2.0/deed.en)
]
]
]

---

# Linear Regression - Penguin Example

Initial analysis suggests that a .orangered_style[Simple Linear Regression] is suitable:

.pull-left[

<img src="data:image/png;base64,#chinstrap_scatter.jpg" width="500px" style="display: block; margin: auto;" />
]

.pull-right[

<img src="data:image/png;base64,#chinstrap_correlation.jpg" width="500px" style="display: block; margin: auto;" />
]

Here we have plotted `flipper length` observations on the x-axis, and `body mass` observations on the y axis.

---

# Linear Regression Notation

When we construct a .orangered_style[Simple Linear Regression], we make a clear distinction between the variables `$x$` and `$y$`.

We treat the `$x$` variable (plotted on the .navy_style[x-axis]) as the .navy_style[independent variable].

* This is also known as the .navy_style[predictor variable]
  
--

We treat the `$y$` variable (plotted on the .orangered_style[y-axis]) as the .orangered_style[dependent variable].

* This is also known as the .orangered_style[response variable]

Generally, we assume that the values of the dependent variable `$y$` are determined (at least in part) by the values of the independent variable `$x$`.

In our .seagreen_style[Chinstrap penguins example], we are assuming that `body mass` `$(y)$` is (partly) determined by `flipper length` `$(x)$`.

* *Note: We can use linear regression even if causation is not present, but we need to be careful with our conclusions, and in such circumstances we cannot use results for predictive purposes*

---

# SLR Equation

The .orangered_style[Simple Linear Regression] model is defined as:

`$$y = \beta_0 + \beta_1 x + \epsilon$$`

Here:

* `$y$` is the dependent variable

* `$x$` is the independent variable

* `$\beta_0$` is the intercept coefficient

* `$\beta_1$` is the slope coefficient

* `$\epsilon$` is an error term, with `$E(\epsilon) = 0$` (details in Topic 5B)

It may help to view this as just a slightly more sophisticated way of writing `$y = mx + c$`, where `$m$` is now `$\beta_1$`, and `$c$` is now `$\beta_0$`.

---

# Fitting a Line to our Data

When we fit a line to our data, we essentially are estimating appropriate values for `$\beta_0$` and `$\beta_1$`, using our sample data points `$(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)$`.

For any fitted line, there will be some vertical distance between each observed `$y$` value, and the corresponding estimated value `$\hat{y}$`.

* If our fitted line is poor, these vertical distances will typically be large (estimates are far from the observed values)

* If our fitted line is good, these vertical distances will typically be small (estimates are close to the observed values)

---

# The Line of Best Fit

There are various methods for calculating the line of best fit for a linear regression.

* The most common method is .orangered_style[Ordinary Least Squares (OLS)]
  
--

.orangered_style[OLS] obtains estimates for `$\beta_0$` and `$\beta_1$` which together minimise the sum of the squared values of the distances between all `$y_i$` and `$\hat{y}_i$` values.

* You are not expected to perform these calculations by hand - they can all be done inside jamovi

---

# Fitted Simple Linear Regression Model

Using our data and the OLS method, we obtain a fitted simple linear regression model, which we can use to estimate the **average** `$y$` value for a given `$x$`:

`$$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x$$`

* Note that the `$\hat{}$` denotes an estimate
  
--

Here:

* `$\hat{y}$` is our estimate of `$y$`
  
--

* `$x$` is not an estimate - we are not try to predict this
  
--

* `$\hat{\beta}_0$` is our estimate of `$\beta_0$`
  
--

* `$\hat{\beta}_1$` is our estimate of `$\beta_1$`
  
--

* The `$\epsilon$` term has disappeared, as we interpret our results as being 'on average', and `$E(\epsilon)$` is assumed to equal `$0$`
  
---

# Fitted SLR Model in jamovi - Penguin Example

Let us take a look at the SLR results for our .seagreen_style[Chinstrap penguin example].

---

Here our estimated model is

`$$\hat{y} = -3037.196 + 34.573x$$`
--

It might help if we replace `$x$` and `$y$` with the actual variables they represent:

`$$\widehat{\text{body mass}} =  -3037.196 + 34.573 \times \text{flipper length}$$`
---

# SLR Coefficients - Penguin Example

When we fit a SLR model, we are really trying to determine if there is statistical evidence of a .orangered_style[linear association] between the dependent and independent variables.

To check this, we test the hypotheses:

`$$H_0: \beta_1 = 0 \text{ vs } H_1: \beta_1 \neq 0$$`

* If `$\beta_1 = 0$` (i.e. if we cannot reject `$H_0$`) then we cannot use our SLR model - there is no statistically significant relationship between the variables `$x$` and `$y$`
  
--

The SLR model output will include a `$p$`-value for each `$\beta$` coefficient. In our .seagreen_style[Chinstrap penguin example], we have `$p < .001$` for `$\beta_1$`.

* **Note 1**: *For theoretical reasons, we always include `$\hat{\beta}_0$` in our model, even if it has a `$p$`-value `$> \alpha$`*
  
  * **Note 2**: *Sometimes the `$\hat{\beta}_0$` value may not make sense at face value. Keep in mind the range of reasonable `$x$` values* 
---

## SLR Confidence Intervals - Penguin Example

In our .seagreen_style[Chinstrap penguin example], you may have noticed that `$\widehat{\beta}_0$` and  `$\widehat{\beta}_1$` also had associated confidence intervals:

These provide us with an interval of likely values for each of these estimates, and offer an additional way to check if our `$\beta$` coefficients can be considered non-zero.

For example, we could say that we are 95% confident that the true value of `$\beta_1$` is between `$24.414$` and `$44.732$`.

* It follows that flipper length is a .orangered_style[significant predictor] of body mass, since the 95% CI for `$\beta_1$` does not contain 0.

---

# Interpreting `$R^2$`

Part of our SLR model output are values for `$R$` and `$R^2$`.

`$R$` denotes the sample correlation between the two variables (i.e. it is our Pearson's Correlation coefficient, see slide 16).

`$R^2$` measures the percentage of variation in the dependent variable that is explained by the independent variable.

* The higher the `$R^2$`, the better 
  
--

* We can think of `$R^2$` as being a little like an effect size for linear regressions
 
<img src="data:image/png;base64,#chinstrap_rsquared.jpg" width="300px" style="display: block; margin: auto;" />

---

# Interpreting `$R^2$`

What constitutes a 'good' `$R^2$` is subjective and can depend on the subject matter, but the following specifications can be used as a guide:

.shadedbox[ .center[
`$0.8 \leq R^2 \leq 1$` : `$\qquad$` "Excellent fit"

`$0.5 \leq R^2 < 0.8$`: `$\qquad$` "Good fit"

`$0.25 \leq R^2 < 0.5$`: `$\qquad$`"Moderate fit"

`$0 \leq R^2 < 0.25$`: `$\qquad$` "Weak fit"
]
]

---

# SLR Summary - Penguin Example

A .orangered_style[simple linear regression] model was fitted to data on `$n=68$` .seagreen_style[Chinstrap penguins] from Dream Island in the Palmer Archipelago in Antarctica. Penguin body mass (grams) was regressed against flipper length (mm).

The Pearson Correlation between these two numeric variables was large, positive, and statistically significant at the `$0.05$` level of significance, with `$r = 0.642$`, `$p < .001$`.

The fitted SLR model was:

`$$\widehat{\text{body mass}} =  -3037.196 + 34.573 \times \text{flipper length}$$`
--

We estimate that on average, a 1 mm increase in flipper length leads to a 34.573 gram increase in body mass in Chinstrap penguins.

The model has a moderate fit, with `$R^2 = 0.412$`. The linear association is statistically significant, with flipper length being a .orangered_style[significant predictor] of body mass `$(\hat{\beta}_1 = 34.573, p < .001)$`.

---

# Using Regression Models for Predictive Purposes

We can use our fitted SLR model to predict `$\hat{y}$` values given new `$x$` values.

.pull-left[

.caption_style[
[Extrapolating](https://xkcd.com/605/) by [xkcd](https://xkcd.com), licenced under [CC BY-NC 2.5](https://creativecommons.org/licenses/by-nc/2.5/)
]
]

.pull-right[

**Interpolation:** Predicting  `$\hat{y}$` values using `$x$` values within the range of known `$x$` values used to fit the regression model.
{{content}}
]
--

**Extrapolation:** Predicting  `$\hat{y}$` values using `$x$` values outside of the range of known `$x$` values used to fit the regression model.
{{content}}

**Note:** It is often dangerous and/or unreasonable to predict `$\hat{y}$` values too far outside the range of your known `$x$` values.

---

# Using Regression Models for Predictive Purposes

As an example, suppose that we observe a penguin with a flipper length of 130 mm, and would like to estimate the penguin's body mass.

Using our fitted SLR model, we can substitute this value of 130 in for flipper length, such that:

`$$\widehat{\text{body mass}} =  -3037.196 + 34.573 \times 130 \approx 1457.294 \text{ grams}$$`
--

Interpreting this result, we estimate that, on average, a penguin with a flipper length of 130 mm will have a body mass of 1457.294 grams.

* Remember, this interpretation is **on average**, and does not tell us the body mass of the specific penguin measured.

* Suppose we extrapolate using our fitted SLR model, and try to estimate the average body mass of a penguin with a flipper length of 50 mm - can you think of any issues that may arise?

---

# Summary

We can use .orangered_style[Correlation] and .orangered_style[Simple Linear Regression (SLR)] to assess and model the strength and direction of the linear association between two .seagreen_style[numeric variables].

* If our variables appear to have a linear association, we can use .orangered_style[Pearson's Correlation] to assess the strength of the linear association
  
--

* If our variables have a non-linear association, we can instead use .orangered_style[Spearman's Correlation]
  
--

* The SLR model is `$y = \beta_0 + \beta_1x + \epsilon$`
  
--

* The `$R^2$` value denotes the quality of the fit of the SLR `$(0 \leq R^2 \leq 1)$`, with higher values being better
  
--

* SLR model interpretations are 'on average'

---

# End

That concludes our lecture on correlation and simple linear regression.

### What to do next:

* Make sure to attend the Topic 5B lecture, in which we will continue and extend our introduction to regression modelling

* If you have any questions, check the LMS, email us or ask in the computer labs

### Optional Further Reading

* Parts from Kokoska (2020) Chapters 12
  
---

# References

*  Green, P. T. (1997). Red crabs in rain forest on Christmas Island, Indian Ocean: activity patterns, density and
biomass. *Journal of Tropical Ecology*, 13(1), 17-38

* Horst,  A.M., Hill, A.P. and Gorman, K.B. (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R
  package version 0.1.0. [https://allisonhorst.github.io/palmerpenguins/](https://allisonhorst.github.io/palmerpenguins/). doi: 10.5281/zenodo.3960218.

* Kokoska, S. (2020). Introductory statistics: a problem-solving approach (Third edition..). W H FREEMAN.

* The jamovi project. (2022). *Jamovi [Computer Software]*. [https://www.jamovi.org](https://www.jamovi.org).

---
class: middle

These notes have been prepared by Rupert Kuveke, Amanda Shaker, and other members of the Department of Mathematical and Physical Sciences. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with the Department of Environment and Genetics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License 
<a href = "https://creativecommons.org/licenses/by-nc-nd/4.0/" target="_blank"> BY-NC-ND. </a>