STM1001 Topic 6 Lecture

class: middle
background-image: url(data:image/png;base64,#LTU_logo.jpg)
background-position: top left
background-size: 30%

# STM1001 [Topic 6](https://amandashaker-stm1001-topic-6.share.connect.posit.cloud/) Lecture
## `$t$`-tests for two-sample hypothesis testing
### La Trobe University
This lecture complements the [Topic 6 readings](https://amandashaker-stm1001-topic-6.share.connect.posit.cloud/)

---

# Topic 6: Related Links

## Readings

[Topic 6 readings](https://amandashaker-stm1001-topic-6.share.connect.posit.cloud/)

## Notation

[Topics 5 and 6: Hypothesis testing and `$t$`-tests](https://amandashaker-stm1001-topic-0.share.connect.posit.cloud/notation-summary.html#topics-5-and-6-hypothesis-testing-and-t-tests)

---

# Topic 6: `$t$`-tests for two-sample hypothesis testing

**Overview**

---

# Today's Lecture

Having learnt about the one-sample `$t$`-test in the previous topic, today we will be learning about two more types of `$t$`-tests:

* The ***independent samples t-test*** (or two-sample `$t$`-test)
    
--

* The ***paired t-test*** (or dependent samples `$t$`-test)

We will learn about these tests via examples, including the following steps:

* Visualising the data
    
--

* Checking the assumptions
    
--

* Carrying out the test
    
--

To conclude, we will discuss ***effect sizes***, which help to determine the relative ***size*** of any differences found (as distinguished from ***statistical significance***).

Remember, if you need to refresh your understanding of any Maths concepts, or notation introduced recently, you can check the [Maths and Notation Summary Guide](https://amandashaker-stm1001-topic-0.share.connect.posit.cloud/index.html). 
---

name: stat
class: middle
background-image: url(data:image/png;base64,#slide_1.png)
background-size: 110%

---

name: stat
class: middle
background-image: url(data:image/png;base64,#slide_9.png)
background-size: 100%

---

# Three types of *t*-tests

In the previous topic, we learnt about the ***one-sample `$t$`-test***. Today, we will cover two more types of `$t$`-tests.

The three versions of the `$t$`-test are:

1. The ***one-sample `$t$`-test*** is used when we have one group, and assess one measurement from each individual in the group. We compare results from the group (i.e. the sample mean) to a fixed reference value

2. The ***independent samples `$t$`-test*** (or two-sample `$t$`-test) is used when we have two independent groups, and assess one measurement from each individual in each group. We compare the two groups to check for similarities or differences
    
--

3. The ***paired `$t$`-test*** (or dependent samples `$t$`-test) is used when we have taken two measurements of the same characteristic from each individual in a group, typically at two time points. We compare the two sets of observations

---

# Students' Eye Colour vs Sleep Time Example

Suppose we made the following claim:

*We believe that on average, STM1001 students with brown eyes spend either more or less time sleeping than people who do not have brown eyes*

Further suppose that a sample of STM1001 students have been asked:

* *In the past 24 hours, how many minutes did you spend sleeping?*, and

* *What is your eye colour?*

We can test this claim using a hypothesis test, just like we did with the sleep example in the previous lecture.

However, since we have two independent groups of individuals here (students with brown eyes and students who don't have brown eyes), we will need to use the ***independent samples `$t$`-test*** (aka the two-sample `$t$`-test).

---
# Eye Colour/Sleep Example Hypotheses

* First, we need to set up our hypotheses:

`$$H_0:\mu_1 = \mu_2\;\;\text{versus}\;\;H_1:\mu_1 \neq \mu_2,$$`
where:

* `$\mu_1$` denotes the population mean number of minutes STM1001 students with brown eyes spend sleeping per day

* `$\mu_2$` denotes the population mean number of minutes STM1001 students who do not have brown eyes spend sleeping per day

Note that if `$\mu_1 = \mu_2$`, this means that the difference between `$\mu_1$` and `$\mu_2$` is zero. So the above hypothesis could equivalently be written as:

`$$H_0:\mu_1 - \mu_2 = 0\;\;\text{versus}\;\;H_1:\mu_1 - \mu_2 \neq 0.$$`

---
# Independent samples `$t$`-test

What does it mean to have ***two independent groups***, as we need to have to carry out an independent-samples `$t$`-test?

* One way of thinking of it would be that individuals can only be in one group or the other: not both.

* E.g. for this example, we assume a person belongs to the 'brown eyes' or 'other' group, but not both.

* This means the two groups are ***independent***, and appropriate for the independent-samples `$t$`-test.

---
# Independent samples `$t$`-test

What type of variables are required for the independent samples `$t$`-test?

.content-box-blue[
.center[
An independent samples *t*-test will always involve two variables:
]
1. The ***dependent*** variable, sometimes also called the *response* variable. This should be a numeric, continuous variable.

2. The ***independent*** variable. This should be a categorical variable with only ***two categories***.
]

* So our ***dependent*** variable is minutes of sleep

* Our ***independent*** variable is eye colour

---
# Assumptions: Independent samples `$t$`-test

The assumptions for the independent samples `$t$`-test are similar to those we saw for the one-sample `$t$`-test, with one addition: ***equal variances*** between groups, also known as ***homogeneity of variance***. In summary:

.content-box-blue[
.center[
**Independent samples *t*-test Assumptions:**
]
1. The data are numeric
2. Observations are independent of one another (that is, the sample is a simple random sample and each individual within the population has an equal chance of being selected)
3. The sample mean, `$\overline{X}$`, is normally distributed
4. Equal variances between groups (aka homogeneity of variances).
]

* Normally, if the standard deviation of one group is more than twice that of the other group, the equal variance assumption has been violated
* There is also a statistical test we can use to help us check the equal variances assumption. We will cover this in more detail shortly

---
# Visualising the data and checking assumptions

* Before carrying out a hypothesis test, it is always a good idea to look at some descriptive statistics and plots

* This give us an idea what to expect when we carry out the test, and also check the assumptions

|               |                |         |       |
|:--------------|:---------------|:--------|:------|
|**Eye Colour** |**Sample size** |**Mean** |**SD** |
|Brown          |51              |459.53   |154.37 |
|Other          |40              |441.88   |113.76 |
---

---
# Visualising data & checking assumptions

From the descriptive statistics and plots, we can observe the following:

1. The boxplots and sample means indicate that the average sleep looks similar between groups. When we carry out the `$t$`-test, we will see whether or not there is a ***statistically significant*** difference

2. From the boxplots, the data appear to be similarly spread out, with slightly more variation in the Brown group. The SD's are also fairly similar to each other (neither one is double the other). This indicates the equal variances assumption has (probably) not been violated.

3. The sample size in the Brown and Other groups are 51 and 40 respectively. This will be useful knowledge later when checking for normality.

---
# Levene's test for equality of variances

To more formally assess equality of variance between groups, we can use the ***Levene's test for equality of variances***.

Consider the following null and alternative hypotheses:

* `$H_0 : \text{The groups have equal variances}$`
* `$H_1 : \text{The groups do not have equal variances}.$`

Since we start out by assuming the groups have equal variances, the test tells us to only reject this assumption if we get a small `$p$`-value. That is, a small `$p$`-value indicates the groups do not have equal variances. To summarise:

.content-box-blue[
.center[
**Levene's test for equality of variances:**
]
* If `$p$` < 0.05, equal variances cannot be assumed
* If `$p$` > 0.05, equal variances can be assumed
]

---
# Levene's test for equality of variances

Let's carry out the Levene's test for the sleep / eye colour data:

```
Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  1  0.6788 0.4122
      89               
```

As we can see, we have `$p = 0.4122$`. Since `$p > 0.05$`, equal variances can be assumed. Given our observations from the box plots and standard deviations, this is not a surprising result.

#### What happens if the equal variances assumption is violated?

* There are two versions of the independent samples `$t$`-test: one that assumes equal variances, and one that does not
    * If equal variances can be assumed, we use the version of the `$t$`-test that assumes equal variances
    * If equal variances cannot be assumed, we use the version of the `$t$`-test that does NOT assume equal variances.

We will have a chance to practise this in the computer lab.

---
# Checking for normality

* Recall that we can consider histograms, Normal Q-Q plots, and the Shapiro-Wilk test to check for normality

* For the independent samples `$t$`-test, this needs to be done for both groups individually

---

Based on the histogram and Normal Q-Q plots, do you have any concerns regarding normality for either group?

---
# Checking for normality

Given there is some doubt following inspection of the histogram and Normal Q-Q plots, the Shapiro-Wilk test results can provide further guidance:

**Shapiro-Wilk test for Brown eyes group:**

```

Shapiro-Wilk normality test

data:  sleep$Minutes[sleep$Eye_colour == "Brown"]
W = 0.89414, p-value = 0.0002696
```

**Shapiro-Wilk test for not Brown eyes group:**

```

Shapiro-Wilk normality test

data:  sleep$Minutes[sleep$Eye_colour == "Other"]
W = 0.9426, p-value = 0.04234
```

Given we have `$p < 0.001$` and `$p = 0.0423$` for each group respectively, it appears the normality assumption has been violated.

However...

---
# Checking for normality

* Recall that the sample size for each group is 51 and 40 respectively.

* Also recall that the underlying distribution does not have to be normally distributed to satisfy the assumption - it is the sample mean that should be normally distributed

* Given `$n > 30$` for both groups, we can therefore apply the Central Limit Theorem and conclude that the **normality assumption has been met**

We are now ready to carry out the independent samples `$t$`-test.