BIO2POS Lecture Topic 2B

class: middle
background-image: url(data:image/png;base64,#LTU_logo_clear.jpg)
background-position: top left
background-size: 25%

# BIO2POS 
# Paired and two sample `$t$`-tests & non-parametric alternatives
## Data Analysis Topic 2B
### La Trobe University

---

# Welcome!

### In this lecture we will introduce additional types of `$t$`-tests, and their non-parametric equivalents.

Over the following slides, we will cover:

* .orangered_style[Paired *t*-test]
  
--

* .orangered_style[Two sample *t*-test]

* Student's `$t$`
    
    * Welch's `$t$`
    
--

* .orangered_style[*t*-test assumptions]
  
--

* .orangered_style[Wilcoxon Signed-Rank test]

* .orangered_style[Mann-Whitney U test]

---

# Intended Learning Objectives

### By the end of this lecture you will:

* understand when to use paired `$t$`-tests and two sample `$t$`-tests

* be able to assess whether `$t$`-test assumptions have been met

* know which test is appropriate to use when  `$t$`-test assumptions fail
  
--

* be able to correctly interpret and summarise the results of the above tests
  
--

The content you learn in Topics 2A and 2B will provide you with a solid foundation for conducting a variety of statistical tests, and we will extend these skills in future DA topics.

We will practice content from this topic in this week's DA computer lab, and the computer lab has some additional extension material if you would like to extend your knowledge.

---

# Types of `$t$`-tests

Recall that we introduced the concept of `$t$`-tests in the [DA Topic 2A lecture](https://rpubs.com/LTU_BIO2POS/DA2A).

A .orangered_style[one sample *t*-test] uses the sample mean `$\overline{X}$` to determine if the population mean `$\mu$` is not equal/greater than/less than a fixed reference value `$\mu_0$`.

* In some contexts, the one sample `$t$`-test will not be appropriate to use

A .orangered_style[paired *t*-test] examines the .bold_style[mean difference] between .orangered_style[two dependent groups] (e.g. *before* and *after*)
  
--

A .orangered_style[two sample *t*-test] (aka independent samples *t*-test) compares the .bold_style[sample means] of .orangered_style[two independent groups] (e.g.*cats* and *dogs*) to determine if the population means are different
  
---

# Paired `$t$`-Tests

We can use a .orangered_style[paired *t*-test] when we have .seagreen_style[paired or repeated measures data] for two dependent groups of individuals.

We begin by computing a sample set of .orangered_style[differences] between the two measurements for each individual (e.g. *after measurement* minus *before measurement*).

* Let `$\mu_d$` denote the .orangered_style[population mean difference].
  
--

Paired `$t$`-test hypotheses will generally take the form:

`$$H_0: \mu_d = 0 \text{ versus } H_1: \mu_d \neq 0\text{ or }$$`

`$$H_0: \mu_d = 0 \text{ versus } H_1: \mu_d < 0\text{ or }$$`

`$$H_0: \mu_d = 0 \text{ versus } H_1: \mu_d > 0$$`

*Note that once we have our set of sample differences, the .orangered_style[paired *t*-test] becomes equivalent to a .orangered_style[one sample *t*-test]*
---

# Paired `$t$` test in practice

Arenales Arauz et al. (2023) assessed if .seagreen_style[whole-body vibration (WBV)] was beneficial for .seagreen_style[cognitive functions].

Individuals sat on a WBV device for 2 minutes, then completed a .seagreen_style[neuropsychological Stroop Colour Word Interference Test (CWIT)]: 
  
  * 52 cards with colour names in different ink colours presented (e.g. .red_style[green])
  
    * The colour must be named correctly
    
    * Completion time recorded (smaller times better)

Control results were also recorded, with the same individuals completing the test after sitting on a normal chair for 2 minutes
  
--

Let `$\mu_d = \mu_{vib} - \mu_{con}$` denote the true population mean difference between WBV (vib) and control (con) results.

We have `$$H_0: \mu_d = 0 \text{ vs } H_1: \mu_d \neq 0$$`

---

# Paired `$t$`-test jamovi output

---

# Paired `$t$`-test Summary - WBV Example

A .orangered_style[paired *t*-test] was conducted to determine if there was a difference in the mean CWIT completion time (seconds) of `$n=60$` individuals who had sat for 2 minutes in a .seagreen_style[WBV device] compared to in a normal chair.

WBV use led on average to a faster completion time `$(M_{vib} = 33.994 \text{ seconds, } SD_{vib} = 6.226 \text{ seconds})$` compared to normal chair use `$(M_{con} = 34.854 \text{ seconds, } SD_{vib} = 6.175 \text{ seconds})$`.

Results of a paired `$t$`-test suggested the mean difference was .orangered_style[statistically significantly non-zero] at the `$\alpha = 0.05$` level of significance, with:

* `$t(59) = -2.210,$` 
 
 * `$p = 0.031 < 0.05$` (two-tailed) 
 
 * `$95\%$` CI of `$(-1.639, -0.081)$`

Note that the .orangered_style[Cohen's *d*] `$=-0.285$`  was small, suggesting that the result may not be clinically important.

---

# Two sample `$t$`-Tests Overview

We can use a .orangered_style[two sample *t*-test] when we have data for .orangered_style[two independent groups] of individuals.

* *Note that two sample `$t$`-tests and independent `$t$`-tests mean the same thing.*

First, we compute a separate sample mean for both groups, and then we test if these means are different.

* Let `$\mu_1$` and `$\mu_2$` denote the .orangered_style[population means] for groups 1 and 2 respectively.
  
    * You can use other subscript notation as desired, as long as it is a rational choice (e.g. `$\mu_{cat}$` and `$\mu_{dog}$` could work if you are comparing cats and dogs)
  
---

# Two sample `$t$`-Tests Hypotheses Options

.pull-left[
Two sample `$t$`-test hypotheses will generally take the form:

`$$H_0: \mu_1 = \mu_2 \text{ vs } H_1: \mu_1 \neq \mu_2,$$`
{{content}}
]

or $$H_0: \mu_1 = \mu_2 \text{ vs } H_1: \mu_1 < \mu_2, $$
{{content}}
--

or `$$H_0: \mu_1 = \mu_2 \text{ vs } H_1: \mu_1 > \mu_2$$`

.pull-right[
These are equivalent, respectively, to writing:

`$$H_0: \mu_1 - \mu_2 = 0 \text{ vs } H_1: \mu_1 - \mu_2 \neq 0$$`
{{content}}
]

or `$$H_0: \mu_1 - \mu_2 = 0 \text{ vs } H_1: \mu_1 - \mu_2 < 0$$`
{{content}}

or `$$H_0: \mu_1 - \mu_2 = 0 \text{ vs } H_1: \mu_1 - \mu_2 > 0$$`
--

Let us look at an example now, to discuss how to apply a two sample `$t$`-test in practice.

---

# Two sample `$t$`-test in practice

.bold_style[Scenario]: .seagreen_style[South Georgian Diving Petrels] (*Pelecanoides georgicus*) live in various locations in the Southern Hemisphere.

A study by Fischer et al. (2018) used statistical analyses of phenotypic differentiations of the different petrel populations to identify a .seagreen_style[new, endangered species] living in .seagreen_style[New Zealand].

In this example analysis of some of their data, we compare the .orangered_style[mean wing length (mm)] of petrels split into .orangered_style[two independent groups] - the NZ group, and a group containing petrels from all other recorded locations.

We have:

`$$H_0: \mu_{NZ} = \mu_{other} \text{ vs } H_1:  \mu_{NZ} \neq \mu_{other}$$`
--

where:

* `$\mu_{NZ}$` denotes the true population mean wing length (mm) of petrels in NZ
  
  * `$\mu_{other}$` denotes the true population mean wing length (mm) of petrels not in NZ
---

class: middle

<img src="data:image/png;base64,#pelecanoides_georgicus.jpg" width="600px" style="display: block; margin: auto;" />
<center>
.caption_style[
Note. From File:Pelecanoides_georgicus,_South_Georgian_diving_petrel.jpg, by TheyLookLikeUs, 2015, 
Wikimedia Commons ([https://commons.wikimedia.org/](https://commons.wikimedia.org/)). [CC BY SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/deed.en)
]
</center>

---

# Two sample `$t$`-test jamovi output

* Here Group 1 refers to the NZ petrels.

---

# Two sample `$t$`-test versions

.bold_style[Important Note:] There are two versions of the two sample `$t$`-test we can use.

* .orangered_style[Student's *t*-test] (default jamovi option)

* .orangered_style[Welch's *t*-test]

The option we select will depend on whether we treat the variances for the two groups as equal, or unequal. By default, we assume the variances are equal.
  
--

* If the variances are equal, we can use the Student's *t*-test

* However, if the variances are unequal, the specified level of significance we have picked for our Student's `$t$`-test (e.g. `$\alpha = 0.05$`) might become inaccurate (e.g. Type I error may increase above what is expected)

* If the variances are unequal, we should use the more robust .orangered_style[Welch's *t*-test]

---

# Levene's Test

We can use a formal statistical test called .orangered_style[Levene's Test] (aka Homogeneity of Variances Test) to check our assumption of equal variance.

* Levene's Test uses the hypotheses:
  
  `$$H_0: \text{ Group Variances are equal vs } H_1: \text{ Group Variances are unequal}$$`
--

If the Levene's Test `$p$`-value is less than `$0.05$`, we reject `$H_0$` and use the .orangered_style[Welch's *t*-test] version of the two sample `$t$`-test; otherwise we use the default option.

--

.bold_style[Suggestion]: If in doubt, always use Welch's `$t$`-test.

---

# Two sample `$t$`-test (with unequal variance) jamovi output

* *Note that the main outcome remains the same (reject `$H_0$`), although some values change slightly*
  
---

# Two sample `$t$`-test Effect Size

To compute .orangered_style[Cohen's *d*] for a two sample `$t$`-test, we have to take into account information from both groups, so our effect size formula changes slightly.

* To complicate matters slightly, there will be different `$d$` equations for the Student and Welch versions
  
--
  
  * You don't need to memorise these (you can check the jamovi output), but formulae are included below for reference

`$$d_{Student} = \dfrac{\overline{X}_2 - \overline{X}_1}{\sqrt{\dfrac{(n_1-1)SD_1^2 + (n_2-1)SD_2^2}{n_1 + n_2 -2}}}$$`

`$$d_{Welch} = \dfrac{\overline{X}_2 - \overline{X}_1}{\sqrt{\dfrac{SD_1^2 + SD_2^2}{2}}}$$`
---

# Two sample `$t$`-test Summary - Petrel Example

A .orangered_style[two sample *t*-test] was conducted to determine if there was a difference in the mean wing length (mm) of .seagreen_style[South Georgian Diving Petrel] populations in New Zealand (*NZ petrels*) compared to other locations (*other petrels*).

Other petrels' wing lengths `$(M_1 = 117.232 \text{ mm }, SD_1 = 4.267 \text{ mm }, n = 69)$` were on average smaller than NZ petrels' wing lengths `$(M_2 = 119.75 \text{ mm }, SD_2 = 2.599 \text{ mm }, n = 128)$`.

Equal group variances could not be confirmed (.orangered_style[Levene's Test] `$p$`-value `$< .001$`). Therefore a .orangered_style[Welch's *t*-test] was conducted.

Results of the Welch's two sample `$t$`-test suggested the difference in average wing lengths between NZ and other petrels was medium-to-large (.orangered_style[Cohen's *d*] `$= -0.713$`) and .orangered_style[statistically significant] at the `$\alpha = 0.05$` level of significance, with `$t(95.865) = -4.475$`, `$p < .001$` (two-tailed), and a `$95\%$` CI of `$(-3.635, -1.401)$`.

* We are `$95\%$` confident that the true population mean wing length for other petrels is between 1.401 mm and 3.635 mm smaller than the true population mean wing length for NZ petrels.

As a result, we can .orangered_style[reject] `$H_0:\mu_{NZ} = \mu_{other}$` and conclude there is a difference in the mean wing lengths across petrel populations.

---

# `$t$`-test assumptions

When we conduct a `$t$`-test, we make several assumptions.

To ensure our analysis procedure and conclusion are valid, we should always check these assumptions!

### Key Assumptions

* The data are .orangered_style[numeric]
  
--

* Observations are .orangered_style[independent of one another] (i.e. we have a simple random sample, with each individual in the population equally likely to be selected)
  
--

* The sample mean(s) is(are) normally distributed
  
--

* .bold_style[Two Sample *t*-tests]: Variances of groups are equal

---

# `$t$`-test assumptions

Checking normality is typically the most important check for `$t$`-tests. The values we check depend on the test:

* .bold_style[One Sample *t*-test]: Check the dependent variable

* .bold_style[Paired *t*-test]: Check paired differences (not the original observations)
  
--

* .bold_style[Two Sample *t*-test]: Check dependent variable by group (both sample sets)
--

### How to check

* .orangered_style[Histogram] with normal/density curve overlaid
  
--

* .orangered_style[Normal Q-Q Plot]
  
--

* Formal statistical test: .orangered_style[Shapiro-Wilk test] and/or .orangered_style[Kolmogorov-Smirnov test]

---

# Assumptions check: Petrel Example

.left-column[
  * The other petrels' wing spans are multi-modal (perhaps due to consisting of several populations)
  
* NZ petrels' wing spans appear potentially normally distributed
]

.right-column[

<img src="data:image/png;base64,#histograms_petrels.jpg" width="500px" style="display: block; margin: auto;" />
]

---

# Assumptions check: Petrel Example

.left-column[
  * The Q-Q plots suggest some non-normality for both groups (note non-linear characteristics, deviation from theoretical line)
]

.right-column[

<img src="data:image/png;base64,#qqplots_petrels.jpg" width="500px" style="display: block; margin: auto;" />
]

---

# Assumptions check: Petrel Example

.pull-left[

<img src="data:image/png;base64,#shapiro_petrels.jpg" width="500px" style="display: block; margin: auto;" />
]

.pull-right[
  * The Shapiro-Wilk test is testing for normality of the data
  {{content}}
]

* A `$p$`-value `$< 0.05$` suggests the data is non-normal
{{content}}

* Here the NZ petrels group fails the normality test `$(p = 0.022)$`, while the other group barely passes `$(p = 0.055)$`

.center[
It would appear our `$t$`-test assumptions have not been met!
]
---

# Normality Considerations

Even if our sample data is non-normal, as long as we have a large sample size `$(n \geq 30)$` we can assume the distribution of the sample mean is normal, via the .orangered_style[Central Limit Theorem (CLT)].

* So for our .seagreen_style[petrel example], technically we can still use the `$t$`-distribution for our sample mean, and conduct the `$t$`-test
  
--

**Concern:** Even if the `$t$`-test is still technically valid, is the mean the best measure of location to use here?

---

# Normality Assumption: Making a Decision

Our decision for the `$t$`-test normality assumption depends on several factors:

### Use the `$t$`-test

* If the distribution of the data is .seagreen_style[normal], and `$n>20$`, the assumption is ok
  
--

* If `$n \geq 30$` and the distribution of the data is .seagreen_style[symmetric], everything is fine
  
--

### Do not use the `$t$`-test

* .orangered_style[If *n* < 20], as normality is hard to establish
 
--

* If `$20 \leq n < 30$` and .orangered_style[normality tests fail] and/or the distribution of the data is .orangered_style[asymmetric], as the normality assumption is violated

* If `$n \geq 30$` and the distribution of data is .orangered_style[asymmetric], as the mean is not the best measure to assess
  
---

# Next Steps when Assumptions Fail

Fortunately, if one or more of the `$t$`-test assumptions have failed, we have several options available.

* Transform the data (e.g. take the log of right-skewed data) to make it more normal
  
--

* Use .orangered_style[non-parametric tests] (which don't rely on parameters like `$\sigma$`, `$df$`). 
  
  These tests:
  
--

* Do not assume an underlying normal distribution
    
--

* Typically use ranks and medians rather than means as measures
    
--

* Are robust to skewed distributions, but not always as powerful as the `$t$`-tests
    
--

The two tests we will look at here, to conclude this topic, are the .orangered_style[Wilcoxon Signed Rank Test] and the .orangered_style[Mann-Whitney U Test].

---

# Introduction to Ranks

An alternative way to consider our data is via ranks.

Ranking data consists of sorting our observations from smallest to largest, to create an ordered list for our subsequent test calculations.

* For one-sample tests, we first subtract the null hypothesis value from each observation
  
  * For paired data, we compute the differences for each pair
  
--

* We then rank the values in terms of absolute value, and then reapply any negative signs

* Our test statistic is the sum of the ranked signed values `$^{\dagger}$`
    
    *The sums of the positive and negative ranks should be similar if the null hypothesis is accurate*

`$^{\dagger}$` *In some software, the test statistic is the lesser of the absolute values of the sums of the ranked signed values - but results should be the same (reject/don't reject `$H_0$`)*

---

# Signed Ranks for WBV Example

The jamovi screenshot below highlights the steps involved in working out the signed ranks for our .seagreen_style[WBV example].

<img src="data:image/png;base64,#signed_rank_wiggle.jpg" width="800px" style="display: block; margin: auto;" />
 
 * *Sit_Diff* has the difference values between the Vibration and Control paired measurements

* *Rank* shows the ordered ranks of the difference values in absolute terms

* *Sign* notes if a difference is negative

* *Signed Rank* shows the ordered, signed rank of the difference values

---

# Wilcoxon Signed-Rank Test - WBV Example

The .orangered_style[Wilcoxon Signed-Rank Test] is a non-parametric test for one sample and paired data.

* It compares ranked ordered observations of positive and negative differences
  
--

`$$H_0: \text{Median difference is zero vs } H_1: \text{ Median difference is non-zero}$$`
<img src="data:image/png;base64,#wilcoxon_wiggle.jpg" width="800px" style="display: block; margin: auto;" />

For our .seagreen_style[WBV example], the .orangered_style[Wilcoxon Signed-Rank Test] results suggest the median difference in Stroop CWIT scores obtained was .orangered_style[statistically significantly non-zero] for individuals sitting in a WBV device compared to a normal chair, with `$W = 624$`, `$p = 0.032$` (two-tailed), `$n=60$`.

---

# Ranks for independent groups

The ranking process is similar for independent groups.

We combine the data, rank it, separate it back into the groups, and then compute the sums of the ranks in each group.

* *The group sums should be similar if the null hypothesis is accurate*

---

# Mann-Whitney U Test

The .orangered_style[Mann-Whitney U Test] is a non-parametric test for two independent groups.

* It compares the ranks of the samples in the two groups, and tests the probability of differences in ranks
  
--

`$$H_0: \text{ Pop. distributions are equal vs } H_1: \text{ Pop. distributions are not equal}$$`
<img src="data:image/png;base64,#mann-whitney_petrels.jpg" width="800px" style="display: block; margin: auto;" />

For our .seagreen_style[petrel example], the .orangered_style[Mann-Whitney U Test] results suggest the distribution of wing span measurements for NZ petrels is .orangered_style[statistically significantly different] to that of other petrels, with `$U = 2863.5$`, `$p< .001$` (two-tailed), `$n_1 = 69$`, `$n_2 = 128$`.
---

# Summary

The type of `$t$`-test we select depends on our data.

* We use paired `$t$`-tests for paired data

* We use two sample (aka independent) `$t$`-tests for data with two independent groups

* We need to check the assumptions of our `$t$`-tests before we conduct our tests

* If the test assumptions fail, we have alternatives (Welch's `$t$`-test, Wilcoxon Signed-Rank Test, Mann-Whitney U Test)

---

# End

That concludes our lecture on `$t$`-tests.

### What to do next:

* .seagreen_style[Quick Kahoot revision quiz]: Please go to [kahoot.it](kahoot.it) and type in the code shown

* Make sure to attend this week's DA computer lab

* If you have any questions, check the LMS, email us or ask in the computer labs

### Optional Further Reading

* Parts from Kokoska (2020) Chapters 8, 9, 10, 14
---

# References

.reference_style[
* Arenales Arauz, Y. L., van der Zee, E. A., Kamsma, Y. P. T., & van Heuvelen, M. J. G. (2023). Short-term effects of side-alternating Whole-Body Vibration on cognitive function of young adults. *PloS One*, 18(1), e0280063–e0280063. [https://doi.org/10.1371/journal.pone.0280063](https://doi.org/10.1371/journal.pone.0280063)

* Cohen, J. (1992). “A Power Primer.” *Psychological Bulletin* 112 (1): 155.

* Cohen, J. (1988). *Statistical Power Analysis for the Behavioral Sciences*. 2nd edition. New York: Academic Press.

*  Fischer, J.H., Debski, I., Miskelly, C.M., Bost, C.A., Fromant, A., Tennyson, A.J.D., et al. (2018). Analyses of phenotypic differentiations among South Georgian Diving Petrel (Pelecanoides georgicus) populations reveal an undescribed and highly endangered species from New Zealand. *PLoS ONE*, 13(6): e0197766. [https://doi.org/10.1371/journal.pone.0197766](https://doi.org/10.1371/journal.pone.0197766)

* Kokoska, S. (2020). Introductory statistics: a problem-solving approach (Third edition..). W H FREEMAN.

* The jamovi project. (2022). *Jamovi* *[Computer Software]*. [https://www.jamovi.org](https://www.jamovi.org).

* Zimmerman, D.W. (2004). A note on preliminary tests of equality of variances. *British Journal of Mathematical and Statistical Psychology*, 57: 173-181. [https://doi-org.ez.library.latrobe.edu.au/10.1348/000711004849222](https://doi-org.ez.library.latrobe.edu.au/10.1348/000711004849222)
]

---
class: middle

These notes have been prepared by Rupert Kuveke, Amanda Shaker, and other members of the Department of Mathematical and Physical Sciences. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with the Department of Environment and Genetics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License 
<a href = "https://creativecommons.org/licenses/by-nc-nd/4.0/" target="_blank"> BY-NC-ND. </a>