---
title: "Non-Parametrics"
author: "J Sigma"
editor: source
format:
html:
css: styles.css
toc: true
toc-depth: 3
number-sections: false
theme: cosmo
code-fold: true
code-tools: true
smooth-scroll: true
embed-resources: true
page-navigation: true
execute:
engine: knitr
echo: true
warning: false
message: false
---
# 1. Introduction
## 1.1. Parametric Techniques
The statistical techniques we have used thus far are **parametric techniques**
::: {.callout-important title="Parametric Techniques"}
**Parametric Techniques** are statistical techniques which
- Assume that data follows a particular distribution
- Assume that the data is normally distributed for large sample sizes by the **central limit theorem**
- Typically rely on quantitative data
- Via the assumption of distribution, sampling distributions of the test statistics are derived, and inferences are made about the unknown population parameters of the particular distribution
- Heavy focus on parameters
:::
::: {.callout-warning title="Example" icon="false"}
We may attempt to find the mean height of first-year students at *UCT* and find that, for this data, the standard deviation of the heights is also unknown.
By collecting random samples, we the find sample means and sample standard deviations, and assume that the data will follow a $t$-distribution. We can then calculate a test-statistic for the mean height and use inference to find whether this is significant at some $\alpha$ level.
This heavily depends on an assumption of normality.
:::
## 1.2. Non-Parametric Techniques
::: {.callout-important title="Non-Parametric Techniques"}
**Non-parametric techniques** are statistical techniques which are valid for a wide variety of underlying distributions, because they
- Only make weak assumptions about the distribution of the data
- Do not depend on parameters specific to a particular distribution
:::
As such, we use non-parametric techniques when
- we have non-normal quantitative data or when the distribution of data is uncertain
- when we have small samples
- when we have qualitative data
::: callout-note
Non-parametric techniques can also be use for when data is normally distributed. So, they are not limited to non-normal data.
However, if we know the underlying distribution of a particular data, it is better to use parametric techniques since this gives us more **power**, i.e., the probability of correctly rejected a false null hypothesis.
So, non-parametric techniques are always valid, but they are sometimes not the optimal choice for power.
:::
## 1.3. Data Types
We differentiate between **qualitative** and **quantitative data**
### 1.3.1. Qualitative (or Categorical) Data
::: {.callout-important title="Definition (Qualitative Data)"}
**Qualitative data** refers to data that represents categories or labels, or levels of a factor. If the data is numbered, then the levels have no arithmetic meaning
:::
Examples of categorical data may include gender, nationality, blood type, colours, and many more. The values of the categories here describe what something is, and not how much of it there is.
We further divide qualitative data into **nominal** and **ordinal** **data**
::: {.callout-warning title="Nominal vs Ordinal Data" icon="false"}
**Nominal data** refers to data which has categories that can be listed without any particular order, and this doesn't change the meaning of the data. For example, we may measure the number of people belonging to blood groups A, B, AB, and O. Here, there is no notion of any blood group having a higher value than any other.
On the other hand, **ordinal data** refers to categorical data that has a clear order structure. For example, we may consider looking at the year level of undergraduate students in University. We will have categorical groups, 1st year, 2nd year, 3rd year, and so on, but there is a clear order to these groups.
:::
### 1.3.2. Quantitative Data
::: {.callout-important title="Definition (Quantitative Data)"}
**Quantitative data** represents measurable quantities, where numerical values have meaningful arithmetic interpretation.
:::
Examples of categorical data includes things like height, income, time, number of lectures attended in a course. These are not just labels; the carry some magnitude meaning.
Similar to qualitative data, we further differentiate between **interval** and **ratio-scaled quantitative data**.
::: {.callout-warning title="Inverval vs Ratio-Scaled Data" icon="false"}
**Ratio-scaled quantitative data** refers to quantitative data where the $0$ value has true meaning, in that it refers to an absence of the quantity being measured. Examples where this is the case is height, weight, and temperature when measured in Kelvin ($0$K indicates an absence of thermal energy), and duration.
Also, ratios between values for this type of data is meaningful. We may say things like, $20$ kilograms is twice as heavy as $10$ kilograms.
**Interval quantitative data** refers ti quantitative data where the zero value has no physical or real meaning. Here, the $0$ value does not mean an absence of the quantity being measured. Examples include IQ score (where $0$ IQ doesn't mean an absence of intelligence, but just a lack thereof), time, and temperature when measured in degrees symbols ($0^{\circ}C$ or $0^{\circ}F$ do not indicate an absence of temperature).
Ratios between values are not meaningful here. For example, $20^{\circ}C$ is not twice as hot as $10^{\circ}C$. To show this, we can convert both to Kelvin (since it has a meaningful zero) and see what the percentage change is:
$10^{\circ}C=283K$ and $20^{\circ}=293K$. So, the change can be found by
$$
\frac{293-283}{293}=0,03412...\approx0,34
$$
This shows that $20^{\circ}C$ is only about $3,4\%$ hotter than $10^{\circ}C$.
:::
## 1.4. Overview of Non-Parametric Tests
### 1.4.1. Single Population Tests
+-------------------------------------+---------------+--------------------------+
| Test | Data Type | Data |
+=====================================+===============+==========================+
| **Tests for Randomness of Order** | Nominal | Independent Observations |
+-------------------------------------+---------------+--------------------------+
| **Chi-Square Goodness of Fit Test** | Nominal | Independent Observations |
+-------------------------------------+---------------+--------------------------+
### 1.4.2. Two Population Tests
+---------------------------------------------+-----------------------------------------+------------------------+----------------------------+
| Tests for Equality of Medians | Data Type | Data | Parametric Test Equivalent |
+=============================================+=========================================+========================+============================+
| **Wilcoxon Rank Sum (Mann-Whitney U) Test** | Ordinal or non-normal Quantitative Data | Independent samples | $t$-test |
+---------------------------------------------+-----------------------------------------+------------------------+----------------------------+
| **Wilcoxon Signed Rank Sum Test** | Non-normal quantitative data | Matched/Paired samples | matched pairs $t$-test |
+---------------------------------------------+-----------------------------------------+------------------------+----------------------------+
| **Sign Test** | Ordinal data | Matched/Paired samples | matched pairs $t$-test |
+---------------------------------------------+-----------------------------------------+------------------------+----------------------------+
### 1.4.3. Three or More Population Tests
+-------------------------+-----------------------------------------+-------------------------+------------------------------------+
| Test | Data Type | Data | Parametric Test Equivalent |
+=========================+=========================================+=========================+====================================+
| **Kruskal-Wallis Test** | Ordinal or non-normal quantitative data | Independent samples | One-Way ANOVA |
+-------------------------+-----------------------------------------+-------------------------+------------------------------------+
| **Friedman Test** | Ordinal or non-normal quantitative data | Matched/Blocked samples | Two-Way ANOVA without interactions |
+-------------------------+-----------------------------------------+-------------------------+------------------------------------+
### 1.4.4. For Relationship Between Two Variables
+--------------------------------------+-----------------------+---------------------+----------------------------+
| Test | Data Type | Data | Parametric Test Equivalent |
+======================================+=======================+=====================+============================+
| **Spearman's Rank Correlation Test** | Ordinal or non-normal | Paired observations | Pearson's correlation |
+--------------------------------------+-----------------------+---------------------+----------------------------+
## 1.5. Ranking Data
Since non-parametric techniques rely on ranks instead of numerical frequencies of data, we need to understand how to rank data. We take the following steps:
1. We begin by ranking data in some sequence (usually ascending)
2. We assign ranks by identifying the relative position of each value in the ordered data
3. We look out for **ties**
If there are no ties in data values, we assign the relative position to the data values. However, if there are ties, we assign an average rank to the tied data values
::: {.callout-warning title="Example One" icon="false"}
Suppose we are given the data values $4,9,6,7,5,2,8$. Then, we would assign ranks as follows:
| Data | 4 | 9 | 6 | 7 | 5 | 2 | 8 |
|:---------------------:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| **Ordered** | 2 | 4 | 5 | 6 | 7 | 8 | 9 |
| **Relative Position** | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
There are no ties in this data, and so we rank via the relative positions.
:::
::: {.callout-warning title="Example Two" icon="false"}
| | | | | | | | | | | | | | |
|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|
| **Data** | **29** | **18** | **29** | **19** | **20** | **21** | **20** | **33** | **30** | **23** | **33** | **33** | **24** |
| **Ordered** | 18 | 19 | 20 | 20 | 21 | 23 | 24 | 29 | 29 | 30 | 33 | 33 | 33 |
| **Relative Position** | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
| **Ranks** | **1** | **2** | **3.5** | **3.5** | **5** | **6** | **7** | **8.5** | **8.5** | **10** | **12** | **12** | **12** |
We have 3 ties in this data
- $20$ and $20$ $\implies$ $\displaystyle{\frac{3+4}{2}}=3.5$
- $29$ and $29$ $\implies$ $\displaystyle{\frac{8+9}{2}}=8.5$
- $33$, $33$, and $33$ $\implies$ $\displaystyle{\frac{11+12+13}{3}}=12$
:::
# 2. Wilcoxon Signed Rank Sum Test
::: {.callout-important title="Key Idea"}
A **Wilcoxon Signed Rank Sum Test** is used for comparing two matched, dependent samples of quantitative data (interval or ratio) with respect to central location.
It tests whether these two samples come from the same population.
:::
Recall that, in parametric tests, we had the privilege that when we had this kind of situation, we had that the data is located around the mean. So, we would perform a **paired** $t$**-test**, taking the difference in the means of the samples and seeing if there was any significant difference between the two samples.
Since, now, we only make weak assumptions about the distribution of the data, we look at the median as the central location of differences. This means that we compare the medians of the two samples to see if there is any signifcant difference between them.
## 2.1. Hypotheses
The null hypothesis, $H_{0}$, is always an assumption of no significant difference between the sample medians of the two groups. For the alternative hypothesis, we can have a one-sided or a two-sided hypothesis. So,
$$
H_{0}: \text{ median of differences}=0 \text{ (i.e., no difference between samples)}\
$$
$$
\text{and}
$$
$$
\\H_{1}:\text{median of differences} \neq 0 \text{ (i.e., there is a difference between samples)}
$$
$$
\\H_{1}: \text{ median of differences}>0 \text{ (i.e., sample one has higher values than sample two)}
$$
$$
\\H_{1}: \text{median of differences}<0 \text{ (i.e., sample two has higer values than sample one)}
$$
::: callout-tip
Always give the null and alternative hypothesis in a way that references all the information given from the context in question. So, you need to make sure that your hypothesis are one-sided or two-sided based on the context, but also that the hypotheses are not general. If the question references finishing time of racers in a race, for example, then this should be evident in your hypotheses.
:::
## 2.2. Data and Assumptions
1. Two paired samples
2. Quantitative data (interval or ratio)
3. Under the assumption of $H_{0}$, the paired differences are symmetric around the median
4. The $n$ paired differences are independent and random
## 2.3. Calculating the Test Statistic
::: {.callout-important title="Test Statistic (Wilcoxon Signed Rank Sum Test)" icon="false"}
We take the following steps:
1. Begin by calculating the differences for each pair
2. Exclude the pairs with a difference of $0$.
3. Record $n$. This is the number of non-zero differences.
4. Record the sign of the pair differences.
5. Record the sign of the paired differences.
6. Rank the absolute values of the differences.
7. The test statistic is the given by
$$
W=\text{Sum of the Signed Ranks}
$$
:::
***Question: Why does this work?***
*Answer: Under the assumption that* $H_{0}$ *is true, the differences will be randomly distributed around the median. If we take the ranks of the differences, then, each rank is equally likely to be* $+$ or $-$.
*Roughly, the positive and negative ranks will cancel out, and so we obtain that*
$$
W\approx0
$$
*So, the question of whether the positive and negative differences balanced in a symmetric way round 0 (i.e.,* $\text{median of differences}=0$) *is answered this way.*
***Question: Why, though, are we opposed to taking the numerical differences and seeing if there is a true difference between the two samples?***
*Answers:* *That is what a paired t-test does. However, it assumes that the differences come from a normal distribution. So, we want to use a method or test that is going to capture this notion, without the assumption that the data follows a normal distribution.*
::: {.callout-warning title="Worked Example Part I: (The Placebo)" icon="false"}
Before studying, a group of $6$ students are told that they are trying a new drink that supposedly improves concentration.
In reality, the drink is just flavoured water.
Each student writes a short test:
- Before drinking it; and
- After drinking it
and their test scores are recorded.\
+--------+--------+-------+
| Person | Before | After |
+:======:+:======:+:=====:+
| **1** | 80.0 | 78.6 |
+--------+--------+-------+
| **2** | 73.5 | 76.0 |
+--------+--------+-------+
| **3** | 85.0 | 81.2 |
+--------+--------+-------+
| **4** | 69.0 | 74.1 |
+--------+--------+-------+
| **5** | 77.8 | 78.5 |
+--------+--------+-------+
| **6** | 90.2 | 86.0 |
+--------+--------+-------+
Is there consistent evidence that the drink improved performance at all?\
\
**Note: For now, we are only focused on how we obtain the test statistic, and not necessarily how we would conduct the whole hypothesis test\
**
+--------+--------+--------+---------------+------------------------+---------+------+------+--------------+
| Person | Before | After | $d_{i}$ | $|d_{i}|$ | Ordered | Rank | Sign | Signed Ranks |
| | | | | | | | | |
| | | | (Differences) | (Absolute Differences) | | | | |
+:======:+:======:+:======:+:=============:+:======================:+:=======:+:====:+:====:+:============:+
| **1** | $80.0$ | $78.6$ | $+1.4$ | $1.4$ | $0.7$ | $1$ | $-$ | $-1$ |
+--------+--------+--------+---------------+------------------------+---------+------+------+--------------+
| **2** | $73.5$ | $76.0$ | $-2.5$ | $2.5$ | $1.4$ | $2$ | $+$ | $+2$ |
+--------+--------+--------+---------------+------------------------+---------+------+------+--------------+
| **3** | $85.0$ | $81.2$ | $+3.8$ | $3.8$ | $2.5$ | $3$ | $-$ | $-3$ |
+--------+--------+--------+---------------+------------------------+---------+------+------+--------------+
| **4** | $69.0$ | $74.1$ | $-5.1$ | $5.1$ | $3.8$ | $4$ | $+$ | $+4$ |
+--------+--------+--------+---------------+------------------------+---------+------+------+--------------+
| **5** | $77.8$ | $78.5$ | $-0.7$ | $0.7$ | $4.2$ | $5$ | $+$ | $+5$ |
+--------+--------+--------+---------------+------------------------+---------+------+------+--------------+
| **6** | $90.2$ | $86.0$ | $+4.2$ | $4.2$ | $5.1$ | $6$ | $-$ | $-6$ |
+--------+--------+--------+---------------+------------------------+---------+------+------+--------------+
**Note: The ordered absolute differences are shuffled. So, they are not necessarily associated with the people in the columns. When finding the signed ranks, you need to associate them appropriately so that the signs are correct.\
**
We find the test statistic as
$$
W=-1+2-3+4+5-5=+1
$$
We see, here, that the $+$ and $-$ signs are scattered across the small and large ranks, and the result is that $W=1$. This is quite close to zero. So, we may infer that there is no significant evidence to reject the null hypothesis of a difference in performance. We may conclude that the placebo did not really work.
:::
::: {.callout-warning title="Worked Example Part II: (The Placebo)" icon="false"}
A second group of students takes the same focus booster drink before writing a similar test. Again, the scores are recorded:
- Before drinking; and
- After drinking
+--------+--------+--------+---------+-----------+-------+------+------+--------------+
| Person | Before | After | $d_{i}$ | $|d_{i}|$ | Order | Rank | Sign | Signed Ranks |
+:======:+:======:+:======:+:=======:+:=========:+:=====:+:====:+:====:+:============:+
| **1** | $79.3$ | $82.5$ | $-3.2$ | $3.2$ | $1.1$ | $1$ | $+$ | $+1$ |
+--------+--------+--------+---------+-----------+-------+------+------+--------------+
| **2** | $69.1$ | $68.0$ | $1.1$ | $1.1$ | $2$ | $2$ | $-$ | $-2$ |
+--------+--------+--------+---------+-----------+-------+------+------+--------------+
| **3** | $85.4$ | $91.2$ | $-5.8$ | $5.8$ | $3.2$ | $3$ | $-$ | $-3$ |
+--------+--------+--------+---------+-----------+-------+------+------+--------------+
| **4** | $73.0$ | $75.0$ | $-2$ | $2$ | $4.5$ | $4$ | $-$ | $-4$ |
+--------+--------+--------+---------+-----------+-------+------+------+--------------+
| **5** | $83.8$ | $88.3$ | $-4.5$ | $4.5$ | $5.8$ | $5$ | $-$ | $-5$ |
+--------+--------+--------+---------+-----------+-------+------+------+--------------+
| **6** | $62.8$ | $70.1$ | $-7.3$ | $7.3$ | $7.3$ | $6$ | $-$ | $-6$ |
+--------+--------+--------+---------+-----------+-------+------+------+--------------+
And we find that our test statistic is
$$
W=1-2-3-4-5-6=-19
$$
This differs vastly from zero, and tells us that the test scores after drinking the focus booster are bigger than the values before drinking the focus booseter.
:::
## 2.4. So, What is $W$ Really Measuring?
$W$ tries to determine whether the signs are distributed randomly across the ranks, or is there a pattern/skew to one particular side. We have the following:
- $W\approx 0 \implies \text{no evidence of a patern -- signs are random}$
- $W>>0 \implies \text{first sample (before) tends to be larger}$
- $W<<0 \implies \text{second sample (after) tends to be larger}$
So, in the test for a significant difference, we ask how extreme $W$ is under $H_{0}$, i.e., how different $W$ is from zero.
## 2.5. Sampling Distribution of $W$
::: callout-note
The **sampling distribution** is the distributionof a statistic (in this case, $W$) over all the possible samples (or outcomes) under a given assumption.
For $W$, the sampling distribution is given by all the possible ways of assigning $+$ or $-$ signs to the ranks since we assume, under the null hypothesis, that the signs of the differences are random, i.e., $\text{median of differences}=0$
:::
Under $H_{0}$, the signs of the ranked differences are random, so each rank is equally likely to be positive or negative. It turns out that, for small sample sizes with no ties, it is possible to deduce the properties of the sampling distribution through simple enumeration of all possibilities
::: {.callout-warning title="Example"}
Suppose that a data set has $3$ values. Then, we will get $3$ ranks from this data set. Let these ranks be $1,2, \text{ and }3$.
Each rank can take on a $+$ or $-$ sign. So, the total number of combinations of $+$ and $-$ signs is going to be given by
$$
2^{3}=8
$$ The following table shows how we can get this:
| 1 | 2 | 3 | W |
|:---:|:---:|:---:|:----------:|
| $+$ | $+$ | $+$ | $1+2+3=6$ |
| $-$ | $+$ | $+$ | $-1+2+3=5$ |
| $+$ | $-$ | $+$ | $+3$ |
| $+$ | $+$ | $-$ | $0$ |
| $-$ | $-$ | $+$ | $0$ |
| $-$ | $+$ | $-$ | $-2$ |
| $+$ | $-$ | $-$ | $-4$ |
| $-$ | $-$ | $-$ | $-6$ |
We can then take the proportion of each of the values for $W$ across the whole group to get the sampling distribution of $W$. We get the following:
```{r}
#################################
# SAMPLING DISTRIBUTION OF W
#################################
W <- c(-6, -4, -2, 0, 2, 4, 6)
prob <- c(1, 1, 1, 2, 1, 1, 1) / 8
barplot(prob,
names.arg = W,
xlab = "W",
ylab = "Proportion",
main = "Proportion of Signed Differences")
```
:::
You can see how this can get out of hand for larger sample sizes since, say, $9$ ranks, will lead to $2^{9}=512$ values for $W$. In this case, it may be useful to use R. Here is an example of a code that may help you perform this:
```{r}
##############################################
# SAMPLING DISTRIBUTION FOR W WITH 9 RANKS
##############################################
# Function to compute sampling distribution of W
wilcoxon_W_dist <- function(n) {
ranks <- 1:n
# Generate all ±1 combinations
signs <- expand.grid(rep(list(c(-1, 1)), n))
# Compute W = sum of positive ranks
W <- apply(signs, 1, function(s) sum(ranks[s == 1]))
# Convert to probability distribution
dist <- table(W) / length(W)
return(dist)
}
# Case n = 9
dist9 <- wilcoxon_W_dist(9)
# Plotting the distribution
barplot(dist9,
xlab = "W",
ylab = "Proportion",
main = "Sampling Distribution of W (n = 9)")
```
Notice that as we increase the number of ranks $n$ increases, the sampling distribution becomes more symmetric and smooth. This suggest that for larger values of $n$, the sampling distribution of $W$ resembles a normal distribution.
In fact, for large sample sizes ($n>10$), the sampling distribution of $W$ can be approximated by a normal distribution with
- a mean of $\mu_{W}=0$; and
- a standard deviation of $\sigma_{W}=\displaystyle{\frac{n(n+1)(2n+1)}{6}}$
For this test, we can also have, either a
(a) **two-sided test**, and we reject $H_{0}$ if $|z|>z_{\frac{\alpha}{2}}$; or
(b) a **one-sided test**, and we reject $H_{0}$ if $z>z_{\alpha} \text{ (right-tailed)}$ or $z<-z_{\alpha} \text{ (left-tailed)}$
We could also use a $p$-value approach whereby we find the $p$-value corresponding to the calculated test statistic. In this case, we reject $H_{0}$ if $p<\alpha$, for a given significance level $\alpha$.
::: {.callout-warning title="Worked Example (A case of larger values of n)" icon="false"}
In the following, we are trying to answer the question of whether a **"flexi-time"** work schedule helps to reduce the travel time of workers
***Note: Your brain should immediately be notifying you that this will be a one-sided test since we are looking for a reduction in the variable of interest***
A random sample of $32$ workers was selected, and workers recorded their time in minutes before and after the program was implemented.
Using the **modified** $p$-**value approach**, test at the $5\%$ significance level. The full Wilcoxon table is given below
```{r}
#########################
# DATA FOR FLEXI-TIME
#########################
data <- data.frame(
Worker = 1:32,
normal_arrival = c(34,35,43,46,16,26,68,38,61,52,68,13,69,18,53,18,
41,25,17,26,44,30,19,48,29,24,51,40,26,20,19,42),
Flextime = c(31,31,44,44,15,28,63,39,63,54,65,12,71,13,55,19,
38,23,14,21,40,33,18,51,33,21,50,38,22,19,21,38)
)
library(dplyr)
library(gt)
data %>%
mutate(
difference = normal_arrival - Flextime,
abs_difference = abs(difference)
) %>%
filter(difference != 0) %>%
mutate(
rank = rank(abs_difference, ties.method = "average"),
signed_rank = rank * sign(difference)
) %>%
gt()
```
We have the following hypotheses:
$H_{0}: \text{There is no difference in the travel time to work the normal and Flexi-time work programs}$
$H_{1}:\text{Workers take longer to travel to work in normal work hours}$
and we are given a significance level of $\alpha=0.05$
Here, $n=\text{number of non-zero differences}=32>10$. So, we can safely assume that the data will follow a normal distribution. We calculate the test statistic as
$$
W=\sum_{i=1}^{32} \text{rank}(d_{i})\cdot \text{sgn}(d_{i})=207
$$
We can then calculate the $z$-score associated with this test statistic as
$$
z=\frac{W-\mu_{W}}{\sigma_{W}}=\frac{207-0}{\sqrt{\frac{(32)(33)(65)}{6}}}=1.935
$$
***Note: Under the assumption that*** $H_{0}$ ***is true, we have that*** $\mu_{W}=0$
We can then find the $p$-value of the test statistic. This is a left-handed test, since we were looking at a reduction in the time taken to arrive. So, we expect that $d>0$. Using this understanding, we can calculate the $p$-value using R
```{r}
#############################
# FINDING P-VALUE
#############################
p <- pnorm(1.935, lower.tail=F)
p
```
**Conclusion:**
Since the $p$-value is less than $0.05$, we reject the null hypothesis. We then conclude that there is significant evidence that workers take longer to travel in the normal work-hour program than they do with a Flexi-time schedule. The median difference is greater than zero.
:::
# 3. Mann-Whitney-U Test
::: {.callout-important title="Key Idea"}
The **Mann-Whitney-U Test** (or **U Test, Wilcoxon Rank Sum Test**, or just **Rank Sum Test**) is used to determine whether two independent samples of ordinal or quantitative data have the same central location (median).
:::
This test is the equivalent of the $t$-test for two samples of normal data.
## 3.1. Data and Assumptions
1. We have two random samples of size $n_{1}$ and $n_{2}$
2. The data are either ordinal or quantitative, but not normal
3. Samples and observations within samples are independent
4. The distributions of the two populations differ with respect to location only (if they differ at all)
## 3.2. Hypothesis Testing for Mann-Whitney-U Tests
### 3.2.1. Hypotheses
We differentiate between one-sided and two-sided hypotheses. So:
**For a two-sided test:**
$$
H_{0}:\text{the two population [medians] are the same}
$$
$$
\text{and}
$$
$$
H_{1}: \text{the two population medians are different}
$$
**For a one-sided test:**
$$
H_{0}:\text{the two population [medians] are the same}
$$
$$
\text{and}
$$
$$
H_{1}: \text{the location of the first population is to the right of the second population}
$$
$$
H_{1}:\text{the location of the first population is to the left of the second population}
$$
### 3.2.2. Calculating the Test Statistic
The test statistic here depends on $n_{1}$ and $n_{2}$. We find it in the following way:
1. Combine the two samples into a single set of values
2. Rank all observations from the smallest to largest, i.e., from $1$ to $n_{1}+n_{2}$
3. Calculate the sum of the ranks, $T_{1}=\text{sum of ranks for } n_{1}$ and $T_{2}=\text{sum of ranks for } n_{2}$
4. We calculate two statistics:
$$
U_{1}=T_{1}-\frac{n_{1}(n_{1}+1)}{2}
$$
$$
\text{and}
$$
$$
U_{2}=T_{2}-\frac{n_{2}(n_{2}+1)}{2}
$$
<!-- -->
5. The final test statistic is given by
$$
U=\text{min}(U_{1}, U_{2})
$$
and we relate it to the specific $T$. So, if $\text{min}(U_{1},U_{2})=U_{1}$, then $T=T_{1}$ will be the test statistic.
### 3.2.3 Conclusion: The Logic
If the locations of the two populations are are about the same, we would expect the sum of ranks $T_{1}$ and $T_{2}$ to be close, and therefore expect that the ranks are evenly spread between the samples.
If $T_{1}$ is sufficiently small, then most of the smaller observations are in population $1$. We then conclude that the location of population $1$ is to the left of population $2$, and reject $H_{0}$.
On the other hand, if $T_{1}$ is sufficiently larger, the most of the larger observations are in population $1$. We conclude, therefore, that the location of population $1$ is to the right of population $1$.
::: {.callout-warning title="Worked Example" icon="false"}
Suppose we have the following samples:
$\text{Sample 1}=\{0, 1, 1,0,1,2,1,2,3\}$
$\text{Sample 2}=\{7, 9, 10, 8, 10, 11, 10, 11, 12\}$
We can combine the two samples into one set of values and rank the new set of values.
| | | | | | | | | | | | | | | | | | |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
| **0** | **0** | **1** | **1** | **1** | **1** | **2** | **2** | **3** | 7 | 8 | 9 | 10 | 10 | 10 | 11 | 11 | 12 |
| 1.5 | 1.5 | 4.5 | 4.5 | 4.5 | 4.5 | 7.5 | 7.5 | 9 | 10 | 11 | 12 | 14 | 14 | 14 | 16.5 | 16.5 | 18 |
Without even calculating the test statistic, we can see that the ranks of sample 2 are much larger than those of sample 1. We can concretely show this. We get
$$
T_{1}=45 \quad \text{and} \quad T_{2}=126
$$
and so we obtain that
$$
U_{1}=0 \quad U_{2}=81
$$
Clearly, then, the test statistic is going to be
$$
T=45
$$
since we have small sample sizes ($n_{1}, n_{2}<10$), we use the **Man-Whitney table** to find the rejection region. We use $\alpha=0$ for this case.
{width="636"}
We define $T_{L}$. This value can be obtained from the table with the appropriate $\alpha$ level, as the intersection of the two sample sizes (for small samples). In this case
$$
T_{L}=63
$$
We define $T_{U}=n_{1}(n_{1}+n_{2}+1)-T_{L}$. In our case, we get that
$$
T_{U}=(9)(9+9+1)-63=108
$$
We reject $H_{0}$ if $T \leq T_{L}$ or $T \geq T_{U}$.
In our case, we reject the null hypothesis since $(T=45) \leq (T_{L}=63)$. We then conclude that the location of sample one is to the left of the location of sample two. Most of the observations in sample one are smaller than the observations in sample two.
:::
For large sample sizes, whereby $n_{1} \text{ or } n_{2}$ are bigger than zero (in the inclusive sense), then the sample distribution of the test statistic can be approximated by a normal distribution.
::: {.callout-note title="Mann-Whitney-U Test for Large Sample Sizes"}
Since $T$ is approximated by a normal distribution for large sample sizes, we can standardise $T$ to obtain a $z$ score:
$$
z=\frac{T-\mu_{T}}{\sigma_{T}}
$$
where
$$
\mu_{T}=\frac{n_{1}(n_{1}+n_{2}+1)}{2} \quad \text{and} \quad \sigma_{T}=\sqrt{\frac{n_{1}n_{2}(n_{1}+n_{2}+1)}{12}}
$$
Then, we reject the null hypothesis if:
- $|z| \geq z_{\alpha/2}$ for a two-sided test
- $z>z_{\alpha}$ for a right-tailed test
- $z<-z_{\alpha}$ for a left-tailed test
- $p \leq \alpha$
:::
Sometimes, the values of $n_{1}$ and $n_{2}$ are not going to match. Given the nature in which we calculate the test statistic for this test, this is not a problem. We still proceed as we have established.
::: {.callout-warning title="Worked Example Two"}
The *ABC Company* has sent $13$ of its employees to a privately-ran programme providing word-processing skills training. Six of the employees were from the data-processing (DP) department, and the rest where from the Typing (T) pool.
At the end of the programme, the company received a report indicating the score receieved by each of the employees out of a total possible score of $100$.
We have the following:
| DP | T |
|:---:|:---:|
| 70 | 59 |
| 52 | 70 |
| 46 | 75 |
| 65 | 85 |
| 60 | 50 |
| 40 | 82 |
| | 64 |
**Is there a difference in the performance of the two groups in the word-processing programme? Test at a** $5\%$ **significance level.**
We state the null and alternative hypotheses as
$$
H_{0}: \text{There is no diifference in the performance between the two groups}
$$
$$
\text{and}
$$
$$
H_{1}:\text{There is a difference in performance between the two groups}
$$
We are given that $\alpha=0.05$. We then combine the two samples for ranking. This gives us the following:
| | | | | | | | | | | | | | |
|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|
| **Data** | **70** | **52** | **46** | **65** | **60** | **40** | 59 | 70 | 75 | 85 | 50 | 82 | 64 |
| **Ordered** | **40** | **46** | 50 | **52** | 59 | **60** | 64 | **65** | 70 | **70** | 75 | 82 | 85 |
| Rank | **1** | **2** | 3 | **4** | 5 | **6** | 7 | **8** | 9.5 | **9.5** | 11 | 12 | 13 |
We find, from this, that $T_{1}=30.5$ and $T_{2}=60.5$. Then,
$$
U_{1}=30.5-\frac{6(7)}{2}=9.5 \quad \text{and} \quad U_{2}=60.5-\frac{7(8)}{2}=32.5
$$
This gives the test statistic as
$$
T=30.5
$$
To calculate the test statistic, we note that $n_{1}=6$ and $n_{2}=7$. Since these are both less than $10$, we can obtain $T_{L}$ using the **Mann-Whitney-U table**. We get that
$$T_{L}=28$$Then,
$$
T_{U}=6(6+7+1)-28=56
$$
**Conclusion: We find that** $T_{L} \leq T \leq T_{U}$**. So, we fail to reject the null hypothesis, and conclude that there is no evidence pf a significant difference in performance between the two groups in the word-processing skills training programme.**
:::
::: {.callout-warning title="Worked Example Three" icon="false"}
A pharmaceutical company is planning to introduce a new painkiller. To determine the effectiveness of the drug in comparison to asprin, $30$ people were randomly selected.
- $15$ people were given the new drug (Sample $1$)
- $15$ people were given asprin (Sample $2$)
Each participant was asked to indicate which one of the five statements best represented the effectiveness of the drug they took. The statements are as follows:
The drug taken was...
- \(5\) Extremely effective
- \(4\) Quite effective
- \(3\) Somewhat effective
- \(2\) Slighly effective
- \(1\) Not effective
***Note: This is ordinal data***
The ratings were recorded as follows
| New Drug | Asprin |
|:--------:|:------:|
| 3 | 4 |
| 5 | 1 |
| 4 | 3 |
| 3 | 2 |
| 2 | 4 |
| 5 | 1 |
| 1 | 3 |
| 4 | 4 |
| 5 | 2 |
| 3 | 2 |
| 3 | 2 |
| 5 | 4 |
| 5 | 3 |
| 5 | 4 |
| 4 | 5 |
**At the** $5\%$ **significance level, is the new drug perceived to be more effective than asprin?**
As usual, we start with the null and alternative hypotheses:
$$
H_{0}:\text{there is no difference in the perceived effectiveness between the two painkillers}
$$
$$
H_{1}: \text{there is a diffeences between the painkillers}
$$
We are given a $5\%$ significance level. We notice that $n_{1},n_{2}>10$, and so the sampling distribution of the test statistic follows a normal distribution. We find the test statistic $T$ first:
| | | |
|:--------:|:-----------:|:--------:|
| **Data** | **Ordered** | **Rank** |
| **3** | **1** | **2** |
| **5** | 1 | 2 |
| **4** | 1 | 2 |
| **3** | **2** | **6** |
| **2** | 2 | 6 |
| **5** | 2 | 6 |
| **1** | 2 | 6 |
| **4** | 2 | 6 |
| **5** | **3** | **12** |
| **3** | **3** | **12** |
| **3** | **3** | **12** |
| **5** | **3** | **12** |
| **5** | 3 | 12 |
| **5** | 3 | 12 |
| **4** | 3 | 12 |
| 4 | **4** | **19.5** |
| 1 | **4** | **19.5** |
| 3 | **4** | **19.5** |
| 2 | 4 | 19.5 |
| 4 | 4 | 19.5 |
| 1 | 4 | 19.5 |
| 3 | 4 | 19.5 |
| 4 | 4 | 19.5 |
| 2 | **5** | **27** |
| 2 | **5** | **27** |
| 2 | **5** | **27** |
| 4 | **5** | **27** |
| 3 | **5** | **27** |
| 4 | **5** | **27** |
| 5 | 5 | 27 |
and so we obtain that $T_{1}=276.5$ and $T_{2}=188.5$. This gives us our test statistic as
$$
T=276.5
$$
Before finding the $z$-score, we calculate
$$
\mu_{T}=\frac{(15)(15+15+1)}{2}=232.5
$$
and
$$
\sigma_{T}=\sqrt{\frac{(15)(15)(15+15+1)}{12}}\approx24.11
$$
and so we obtain that
$$
z=\frac{276.5-232.5}{24.11}=1.82
$$
This is a one-sided test, and so we will reject $H_{0}$ if the test statistic is greater than
```{r}
####################
# CRITICAL VALUE
###################
zcrit <- qnorm(0.05, lower.tail=FALSE)
zcrit
```
which it clearly is. We would also reject if the $p$-value is less that $0.05$
```{r}
#############
# P-VALUE
############
pval <- pnorm(1.82, lower.tail=FALSE)
pval
```
which, again, it clearly is.
**Conclusion: We reject** $H_{0}$ **and** **conclude that there is significant evidence that there is a difference in effectiveness between the two drugs. That is, The new drug performs better than asprin.**
:::
# 4. Kruskal-Wallis Test
::: {.callout-important title="Key Idea"}
A **Kruskal-Wallis test** is used when we want to compare two or more independent groups/samples of ordinal data or quantitative data with respect to their medians.
:::
It is the equivalent of a **single factor ANOVA**.
## 4.1. Data and Assumptions
1. The data is either ordinal or quantitative, but not necessarily normal
2. The treatment levels and observations within each treatment level are independent
3. There are, at least, three observations per group/sample
4. The distributions of the groups differ with respect to their location (median) only, if they differ at all
## 4.2. Hypothesis Testing for the Kruskal-Wallis Test
### 4.2.1. Hypotheses
We have the following:
$$
H_{0}: \text{the locations of the $k$ populations (groups) are the same}
$$
$$
H_{1}: \text{at least two populations differ}
$$
### 4.2.2. Calculating the Test Statistic
1. We combine the observations from all the $k$ groups to form one sample. This sample will have $n_{T}=\sum_{i=1}^{k}n_{j}$ observations.
2. Then, we rank the observations, averaging ranks for all tied observations
3. We calculate the sum of ranks, $T_{1}, T_{2},\dots,T_{k}$, for all the $k$ groups
::: callout-note
As a consequence of this, we have that
$$
\sum_{i=1}^{k}T_{i}=\frac{n_{T}(n_{T}+1)}{2}
$$
:::
The test statistic is then given by
$$
H=\left[\frac{12}{n_{T}(n_{T}+1)}\sum_{i=1}^{k}\left(\frac{T^{2}_{i}}{n_{i}}\right)\right]-3(n_{T}+1)
$$
::: callout-note
If all the populations have the same location, i.e. $H_{0}$ is true, then the ranks should be evenly distributed among the $k$ samples and the $H$ statistic will be small.
Here, "small" means "sufficiently close to zero"
:::
### 4.2.3. Critical Region
When the sample sizes of the $k$ groups is at least three, the sampling distribution of $H$ is a **chi-squared distribution** with $k-1$ degrees of freedom. Thus, the test is one-sided, and we reject $H_{0}$ if $H$ is too large ($H \geq c$) for some critical value $c$, or if $p \leq \alpha$ for some defined significance level $\alpha$.
::: callout-note
If you are wondering how we calculate the critical region for when $n_{i}<3$, **we don't**. The Kruskal-Wallis test is particularly defined for $n_{i} \geq 3$ for the $k$ groups. It so happens that the test statistic follows a chi-squared distribution for this.
:::
::: {.callout-warning title="Worked Example" icon="false"}
A 24hr restaurant wanted to determine how customers rate three shifts with respect to speed of service. Three samples of $10$ customer response-cards were randomly selected, one sample from each shift, and customer ratings (from $1$ for "very slow" to $5$ for "very quick") were recorded. The **ranked data** was recorded in the following table
| 4:00 - midd | midd - 8:00 | 8:00 - 4:00 |
|:-----------:|:-----------:|:-----------:|
| 4 (27) | 3 (16.5) | 3 (16.5) |
| 4 (27) | 4 (27) | 1 (2) |
| 3 (16.5) | 2 (6.5) | 3 (16.5) |
| 4 (27) | 2 (6.5) | 2 (6.5) |
| 3 (16.5) | 3 (16.5) | 1 (2) |
| 3 (16.5) | 4 (27) | 3 (16.5) |
| 3 (16.5) | 3 (16.5) | 4 (27) |
| 3 (16.5) | 3 (16.5) | 2 (6.5) |
| 2 (6.5) | 2 (6.5) | 4 (27) |
| 3 (16.5) | 3 (16.5) | 1 (2) |
**Can we conclude that customers perceive the speed of service to be different among the three shifts at a 5 percent significance level?**
We have our hypotheses:
$$
H_{0}: \text{there is no difference in perception of the speed of service}
$$
$$
H_{1}: \text{there is a difference in the perception of the speed of service}
$$\
From the table, we find that
$$
T_{1}=186.5 \quad T_{2}=156 \quad T_{3}=122.5
$$
and we can calculate the test statistic as
$$
H=\frac{12}{30(30+1)}\left(\frac{(186.5)^{2}}{10}+\frac{(156)^{2}}{10}+\frac{(122.5)^{2}}{10}\right)-3(30+1)=2.645
$$
we can calculate the critical region
```{r}
##########
# CRIT
#########
k <- 3
chi_crit <- qchisq(0.05, df=k-1, lower.tail=F)
chi_crit
```
and the $p$-value
```{r}
##############
# p-value
##############
p <- pchisq(2.645, k-1, lower.tail=F)
p
```
**Conclusion: In this case, we fail to reject the null hypothesis since our test statistic is not more extreme than the critical value, and** $p>0.05$**.** **We then conclude that there is no evidence of a difference in the perception of speed of service between the different shifts.**
:::
# 5. Friedman Test
::: {.callout-important title="Key Idea"}
A **Friedman test** is used when comapring more than two groups or samples of ordinal or quantitative data, using matched or blocked samples, with respect to their (median) locations.
:::
A Friedman test is the equivalent of an **randomised block design two-way ANOVA without interactions**
## 5.1. Data and Assumptions
1. Data is either ordinal or quantitative, but not normal
2. The data comes from a blocked experiment with **b** blocks
3. The measurements **within a block** are **dependent**
4. The measurements **between blocks** are **independent**
5. No interaction between blocks and treatments
## 5.2. Hypothesis Testing for the Friedman Test
Before going deep into how we perform a hypothesis test for the Friedman test, it is worth looking at the structure of the experiments for which the test is used to investigate.
<div>
Recall that **blocking** is introduced into an experiment to improve comaprison of the treatments by grouping the experimental units into blocks based on them being the same with regards to some characteristic. These blocks will have the same number of experimental units, each having the treatment occurring once. So,
$$
\text{number of units in each block}=\text{number of treatments}
$$
Here is an example of this:
| Treatment | **Block 1** | Block 2 | Block 3 | Block 4 |
|:---------:|:-----------:|:--------:|:--------:|:--------:|
| 1 | $y_{11}$ | $y_{12}$ | $y_{13}$ | $y_{14}$ |
| 2 | $y_{21}$ | $y_{22}$ | $y_{23}$ | $y_{24}$ |
| 3 | $y_{31}$ | $y_{32}$ | $y_{33}$ | $y_{34}$ |
| 4 | $y_{41}$ | $y_{42}$ | $y_{43}$ | $y_{44}$ |
| 5 | $y_{51}$ | $y_{52}$ | $y_{53}$ | $y_{54}$ |
</div>
So, we will end up measuring whether the $k$ treatment groups differ in their median.
### 5.2.1. Hypotheses
We have the following:
$$
H_{0}: \text{the locations of the $k$ populations are the same}
$$
$$
\text{and}
$$
$$
H_{1}: \text{at least two population locations differ}
$$
::: callout-tip
Remember to interpret your hypotheses based on the context of the question which you are trying to answer
:::
### 5.2.2. Calculating the Test Statistic
1. Rank the observations from smallest to largest within each block
2. Average ranks of tied observations within the same block
3. Calculate the rank sums $T_{1}, T_{2}, \dots, T_{k}$ for all the $k$ treatments
The test statistic is then given by
$$
F_{r}=\left[\frac{12}{b(k)(k+1)}\sum_{j=1}^{k}T_{j}^{2}\right]-3b(k+1)
$$
where
- $b$ is the number of blocks
- $k$ is the number of treatments; and
$F_{r}$ is the actual test statistic which has a **chi-squared** **distribution** (approximately) provided that $k \geq5$ or $b \geq 5$ with $k-1$ degrees of freedom
We then reject the null hypothesis if $F_{r}$ is too large under the assumption of the null hypothesis
:::: {.callout-warning title="Worked Example" icon="false"}
Four managers evaluate applicants for a job in an accounting firm on several dimensions including academic credentials, previous work experience and personal suitability. Each manager then summarises the results and produces an evaluation of the candidates. There are $5$ possibilities:
1. The candidate is in the top $5\%$ of applicants
2. The candidate is in the top $10\%$ of applicants, but not the the top $5\%$
3. The candidate is in the top $25\%$ of applicants, but not in the top $10\%$
4. The candidate is in the top $50\%$ of applicants, but not in the top $25\%$
5. The candidate is in the bottom $50\%$ of applicants
Eight applicants were chosen at randomly selected, and their evaluations by the four managers were recorded.
+-------------+-------------+-------------+-------------+-------------+
| Applicant | Manager 1 | Manager 2 | Manager 3 | Manager 4 |
+:===========:+:===========:+:===========:+:===========:+:===========:+
| **1** | 2 | 1 | 2 | 2 |
+-------------+-------------+-------------+-------------+-------------+
| **2** | 4 | 2 | 3 | 2 |
+-------------+-------------+-------------+-------------+-------------+
| **3** | 2 | 2 | 2 | 3 |
+-------------+-------------+-------------+-------------+-------------+
| **4** | 3 | 1 | 3 | 2 |
+-------------+-------------+-------------+-------------+-------------+
| **5** | 3 | 2 | 3 | 5 |
+-------------+-------------+-------------+-------------+-------------+
| **6** | 2 | 2 | 3 | 3 |
+-------------+-------------+-------------+-------------+-------------+
| **7** | 4 | 1 | 5 | 5 |
+-------------+-------------+-------------+-------------+-------------+
| **8** | 3 | 2 | 5 | 3 |
+-------------+-------------+-------------+-------------+-------------+
**Can we say that there are differences in the way the managers evaluate candidates?**
Here, we are trying to determine how getting scored by a particular manager affects where the applicants are placed in the candidacy groups. So, the treatments are the managers. The blocking factor are the applicants themselves since the treatments are applied to all the applicants.
::: callout.tip
***To find the treatments, always ask yourself, "What effect are we trying to measure?" Since we are trying to measure the effect that each manager has on the scoring, that is our treatment – the managers.***
***Usually, then, the blocks will follow from this. However, you can ask yourself "What is being measured repeatedly for each treatment?"***
:::
Notice, also, that the observations within each block (the applicants) are dependent since they are measured on the same applicant. This makes sense since a stronger applicant is very likely to score higher across all groups.
For, the hypotheses, we have
$$ H_{0}: \text{there is no difference in the way that managers evaluate candidates} $$
$$ H_{1}: \text{there is a difference in the way that managers evaluate candidates} $$
To calculate the test statistic, we first rank within the blocks to obtain the sum of ranks. We have the following:
+-------------+-------------+-------------+-------------+-------------+
| Applicant | Manager 1 | Manager 2 | Manager 3 | Manager 4 |
+:===========:+:===========:+:===========:+:===========:+:===========:+
| **1** | 2 (3) | 1 (1) | 2 (3) | 2 (3) |
+-------------+-------------+-------------+-------------+-------------+
| **2** | 4 (4) | 2 (1.5) | 3 (3) | 2 (1.5) |
+-------------+-------------+-------------+-------------+-------------+
| **3** | 2 (2) | 2 (2) | 2 (2) | 3 (4) |
+-------------+-------------+-------------+-------------+-------------+
| **4** | 3 (3.5) | 1 (1) | 3 (3.5) | 2 (2) |
+-------------+-------------+-------------+-------------+-------------+
| **5** | 3 (2.5) | 2 (1) | 3 (2.5) | 5 (4) |
+-------------+-------------+-------------+-------------+-------------+
| **6** | 2 (1.5) | 2 (1.5) | 3 (3) | 4 (4) |
+-------------+-------------+-------------+-------------+-------------+
| **7** | 4 (2) | 1 (1) | 5 (3.5) | 5 (3.5) |
+-------------+-------------+-------------+-------------+-------------+
| **8** | 3 (2.5) | 2 (1) | 5 (4) | 3 (2.5) |
+-------------+-------------+-------------+-------------+-------------+
and we get the sum of ranks as $T_{1}=21$, $T_{2}=10$, $T_{3}=24.5$, and $T_{4}=24.5$. We can then calculate the test statistic. We obtain that
$$ F_{r}=\left[\frac{12}{(8)(4)(4+1)}\left((21)^{2}+(10)^{2}+(24.5)^{2}+(24.5)^{2}\right)\right]-3(8)(4+1)=10.61 $$
We can find the critical value (and therefore the critical region)
```{r}
####################
# CRITICAL VALUE
####################
k <- 4
crit <- qchisq(0.05, df=k-1, lower.tail=F)
crit
```
and the $p$-value associated with the test statistic.
```{r}
############
# P VALUE
############
p <- pchisq(10.61, df=k-1, lower.tail=F)
p
```
**Conclusion: Based on the test statistic being more extreme than the critical value, and having a** $p$-**value less than** $0.05$**, we reject the null hypothesis and conclude that there is evidence of a difference in the way that the different managers evaluate the candidates.**
::::
# 6. Spearman Rank Correlation Coefficient Test
::: {.callout-important title="Key Idea"}
The **Spearman Rank Correlation Coefficient Test** is used to measure the association between two samples/variables of ordinal or quantitative data
:::
This test is equivalent to the **Pearson's Correlation Coefficient Test**
## 6.1 Data and Assumptions
1. Both variables are, at least, ordinal (though, they may be quantitative), and at least one variable is not normal
2. There are a total of $n$ randomly selected paired observations
::: callout-note
Sprearman's rank correlation coefficient is interpreted the same way as Pearson's correlation. That is,
$$
-1 \leq r_{s} \leq 1
$$
and
- $-1 \implies$ perfect negative relationship
- $-0.5 \implies$ moderate negative relationship
- $0 \implies$ no relationship
- $0.5 \implies$ moderate positive relationship
- $+1 \implies$ perfect positive relationship
:::
## 6.2. Hypothesis Testing for Spearman Rank Correlation Test
### 6.2.1. Hypotheses
The null hypothesis is given by
$$
H_{0}: \rho_{s}=0 \text{ (no association between the two variables in the underlying population)}
$$
and the alternative hypotheses can either be one-sided or two-sided. For a two-sided alternative hypothesis, we have
$$
H_{1}: \rho_{s} \neq 0 \text{ (there is an association between the two uvariables in the underlying population)}
$$
and, for the one-sided alternative hypotheses, we have
$$
H_{1}: \rho_{s}>0 \text{ (positive correlation)}
$$
$$
\text{and}
$$
$$
H_{1}: \rho_{s}<0 \text{ (negative correlation)}
$$
### 6.2.2. Calculating the Test Statistic
To calculate the test statistic, we
1. Rank rhe populations separately
2. Calculate the difference, $d$, within each pair of ranks. So,
$$
d_{i}=\text{rank}(x_{i})-\text{rank}(y_{i})
$$
3. The test statistic is then given by
$$
r_{s}=1-\frac{6\sum_{i=1}^{n}d^{2}_{i}}{n(n^{2}-1)}
$$
where $n$ is the **number of pairs** of data
For large samples ($n \geq 10$), the sampling distribution of the test statistic, $r_{s}$ is approximately normal, and the test $z$-score is given by
$$
z=\frac{r_{s}-\mu_{r_{s}}}{\sigma_{r_{s}}}
$$
where $\mu_{r_{s}} = 0$ under the assumption that $H_{0}$ is true and $\sigma_{r_{s}} = \sqrt{\frac{1}{n-1}}=\frac{1}{\sqrt{n-1}}$. From, this, we can simplify the $z$ calculation by observing that
$$
z=r_{s}\sqrt{n-1}
$$
under the assumption that $H_{0}$ is true.
### 6.2.3. Conclusion
We the reject the null hypothesis if
- $|z| \geq z_{\alpha/2}$ for a two-sided test
- $z>z_{\alpha}$ for a right-tailed test; and
- $z<-z_{\alpha}$ for a left-tailed test; **OR**
- if the $p$-value is less than the defined $\alpha$
::: {.callout-warning title="Worked Example" icon="false"}
After several semesters without much success, Pat Statstud (a student in the lowest quarter of a statistics course) decided to try and improve his performance. Pat needed to know the secret of success for university students.
After many hours of discussion with other more successful students, Pat postulated a rather radical theory: **the longer one studied, the better one’s grade**.
To test the theory, Pat took a random sample of 35 students in an economics course and asked each to report the average amount of time he or she studied economics, and the final mark (out of 100) obtained (see results on next slide).
**Test to determine whether grade and study time are positively related.**
The ranked data is as follows.
```{r}
###############################
# STUDY TIME VS MARK DATA
###############################
library(dplyr)
library(gt)
# Left block
left <- tibble(
Time = c(30, 5, 36, 37, 32, 23, 34, 2, 34, 43, 34, 32, 30, 36, 40, 24, 0, 25),
Rank_Time = c(17, 4, 30.5, 32, 22.5, 7, 28, 2.5, 28, 35, 28, 22.5, 17, 30.5, 34, 8.5, 1, 10.5),
Mark = c(71, 30, 82, 98, 78, 73, 82, 25, 94, 99, 85, 74, 79, 82, 88, 55, 7, 62),
Rank_Mark = c(9, 4, 17.5, 34, 14, 10.5, 17.5, 3, 32, 35, 22, 12, 15, 17.5, 26, 5, 1, 6)
)
# Right block
right <- tibble(
Time = c(29, 21, 31, 30, 33, 30, 33, 22, 29, 24, 30, 2, 31, 33, 25, 38, 26),
Rank_Time = c(13.5, 5, 20.5, 17, 25, 17, 25, 6, 13.5, 8.5, 17, 2.5, 20.5, 25, 10.5, 33, 12),
Mark = c(91, 66, 66, 73, 90, 88, 91, 64, 83, 87, 96, 16, 84, 92, 82, 88, 75),
Rank_Mark = c(29.5, 8, 23, 10.5, 28, 26, 29.5, 7, 20, 24, 33, 2, 21, 31, 17.5, 26, 13)
)
# Combine and format
data <- bind_rows(left, right)
data %>%
gt() %>%
tab_header(
title = "Study Time vs Marks Dataset"
) %>%
fmt_number(
columns = everything(),
decimals = 1
) %>%
tab_options(
table.font.size = "small"
)
```
We start with the null and alternative hypotheses. The null hypothesis is given by
$$
H_{0}: \text{more time spent studying doesn't improve one's grade } (\rho_{s}=0)
$$
and the alternative hypothesis is
$$
H_{1}: \text{more time spent studying improvesone's grade} (\rho_{s}>0)
$$
We will test at the $5\%$ significance level. To calculate the test statistic, we will need the differences. These are given in the table below.
```{r}
###############################
# STUDY TIME VS MARK DATA
###############################
library(dplyr)
library(gt)
# Left block
left <- tibble(
Time = c(30, 5, 36, 37, 32, 23, 34, 2, 34, 43, 34, 32, 30, 36, 40, 24, 0, 25),
Rank_Time = c(17, 4, 30.5, 32, 22.5, 7, 28, 2.5, 28, 35, 28, 22.5, 17, 30.5, 34, 8.5, 1, 10.5),
Mark = c(71, 30, 82, 98, 78, 73, 82, 25, 94, 99, 85, 74, 79, 82, 88, 55, 7, 62),
Rank_Mark = c(9, 4, 17.5, 34, 14, 10.5, 17.5, 3, 32, 35, 22, 12, 15, 17.5, 26, 5, 1, 6)
)
# Right block
right <- tibble(
Time = c(29, 21, 31, 30, 33, 30, 33, 22, 29, 24, 30, 2, 31, 33, 25, 38, 26),
Rank_Time = c(13.5, 5, 20.5, 17, 25, 17, 25, 6, 13.5, 8.5, 17, 2.5, 20.5, 25, 10.5, 33, 12),
Mark = c(91, 66, 66, 73, 90, 88, 91, 64, 83, 87, 96, 16, 84, 92, 82, 88, 75),
Rank_Mark = c(29.5, 8, 23, 10.5, 28, 26, 29.5, 7, 20, 24, 33, 2, 21, 31, 17.5, 26, 13)
)
# Combine and format
data <- bind_rows(left, right)
data %>%
mutate(
d_i = Rank_Time - Rank_Mark
) %>%
gt() %>%
tab_header(
title = "Study Time vs Marks Dataset (with Differences)"
) %>%
fmt_number(
columns = everything(),
decimals = 1
) %>%
tab_options(
table.font.size = "small"
)
```
Now, we are ready to calculate the test statistic.
$$
r_{s}=1-6\left[\frac{(8)^{2}+(0)^{2}+(13)^{2}+\dots+(7)^{2}+(-1)^{2}}{35((35)^{2}-1)}\right]\approx0.7251
$$
and the associated $z$-score will be
$$
z=0.7251\sqrt{35-1}=4.228
$$
The critical region (and note, this is a one-sided test) is given by
```{r}
####################
# CRITICAL POINT
####################
zcrit <- qnorm(0.05, lower.tail=F)
zcrit
```
$z \geq 1.645$, and the $p$-value for the test statistic is given by
```{r}
##################
# P-VALUE
#################
pv <- pnorm(4.228, lower.tail=F)
pv
```
**Conclusion: We reject the null hypothesis since the test statistic falls into the rejection region, and the** $p$**-value is less than the significance level defined (**$\alpha=0.05$**). We then conclude that there is significant evidence of a positive relationship between the amount of time spent studying and the grade of a student.**
:::
# 7. Advantages and Disadvantages of Non-Parametric Statistical Techniques
::: {.callout-important title="Advantages of Non-Parameteric Tests"}
- Can be used when parametric techniques are not suited for the data samples given, and the validity of their assumptions is uncertain
- Useful for small sample sizes
- The assumptions are usually few and easily met
- They are not just restricted to quantitative data
:::
::: {.callout-important title="Disadvantages of Non-Parametric Tests"}
- Information is lost by ranking or taking signed ranks. As a result, we lose more **power** (the probability of rejecting the null hypothesis when it is, in fact, false) compared to the equivalent parametric tests (when one is appropriate for the data)
:::