Non-Parametrics

Author

J Sigma

1. Introduction

1.1. Parametric Techniques

The statistical techniques we have used thus far are parametric techniques

Parametric Techniques

Parametric Techniques are statistical techniques which

Assume that data follows a particular distribution
Assume that the data is normally distributed for large sample sizes by the central limit theorem
Typically rely on quantitative data
Via the assumption of distribution, sampling distributions of the test statistics are derived, and inferences are made about the unknown population parameters of the particular distribution
Heavy focus on parameters

Example

We may attempt to find the mean height of first-year students at UCT and find that, for this data, the standard deviation of the heights is also unknown.

By collecting random samples, we the find sample means and sample standard deviations, and assume that the data will follow a $t$-distribution. We can then calculate a test-statistic for the mean height and use inference to find whether this is significant at some $\alpha$ level.

This heavily depends on an assumption of normality.

1.2. Non-Parametric Techniques

Non-Parametric Techniques

Non-parametric techniques are statistical techniques which are valid for a wide variety of underlying distributions, because they

Only make weak assumptions about the distribution of the data
Do not depend on parameters specific to a particular distribution

As such, we use non-parametric techniques when

we have non-normal quantitative data or when the distribution of data is uncertain
when we have small samples
when we have qualitative data

Note

Non-parametric techniques can also be use for when data is normally distributed. So, they are not limited to non-normal data.

However, if we know the underlying distribution of a particular data, it is better to use parametric techniques since this gives us more power, i.e., the probability of correctly rejected a false null hypothesis.

So, non-parametric techniques are always valid, but they are sometimes not the optimal choice for power.

1.3. Data Types

We differentiate between qualitative and quantitative data

1.3.1. Qualitative (or Categorical) Data

Definition (Qualitative Data)

Qualitative data refers to data that represents categories or labels, or levels of a factor. If the data is numbered, then the levels have no arithmetic meaning

Examples of categorical data may include gender, nationality, blood type, colours, and many more. The values of the categories here describe what something is, and not how much of it there is.

We further divide qualitative data into nominal and ordinal data

Nominal vs Ordinal Data

Nominal data refers to data which has categories that can be listed without any particular order, and this doesn’t change the meaning of the data. For example, we may measure the number of people belonging to blood groups A, B, AB, and O. Here, there is no notion of any blood group having a higher value than any other.

On the other hand, ordinal data refers to categorical data that has a clear order structure. For example, we may consider looking at the year level of undergraduate students in University. We will have categorical groups, 1st year, 2nd year, 3rd year, and so on, but there is a clear order to these groups.

1.3.2. Quantitative Data

Definition (Quantitative Data)

Quantitative data represents measurable quantities, where numerical values have meaningful arithmetic interpretation.

Examples of categorical data includes things like height, income, time, number of lectures attended in a course. These are not just labels; the carry some magnitude meaning.

Similar to qualitative data, we further differentiate between interval and ratio-scaled quantitative data.

Inverval vs Ratio-Scaled Data

Ratio-scaled quantitative data refers to quantitative data where the $0$ value has true meaning, in that it refers to an absence of the quantity being measured. Examples where this is the case is height, weight, and temperature when measured in Kelvin ($0$K indicates an absence of thermal energy), and duration.

Also, ratios between values for this type of data is meaningful. We may say things like, $20$ kilograms is twice as heavy as $10$ kilograms.

Interval quantitative data refers ti quantitative data where the zero value has no physical or real meaning. Here, the $0$ value does not mean an absence of the quantity being measured. Examples include IQ score (where $0$ IQ doesn’t mean an absence of intelligence, but just a lack thereof), time, and temperature when measured in degrees symbols ($0^{\circ}C$ or $0^{\circ}F$ do not indicate an absence of temperature).

Ratios between values are not meaningful here. For example, $20^{\circ}C$ is not twice as hot as $10^{\circ}C$. To show this, we can convert both to Kelvin (since it has a meaningful zero) and see what the percentage change is:

$10^{\circ}C=283K$ and $20^{\circ}=293K$. So, the change can be found by

\[ \frac{293-283}{293}=0,03412...\approx0,34 \]

This shows that $20^{\circ}C$ is only about $3,4\%$ hotter than $10^{\circ}C$.

1.4. Overview of Non-Parametric Tests

1.4.1. Single Population Tests

Test	Data Type	Data
Tests for Randomness of Order	Nominal	Independent Observations
Chi-Square Goodness of Fit Test	Nominal	Independent Observations

1.4.2. Two Population Tests

Tests for Equality of Medians	Data Type	Data	Parametric Test Equivalent
Wilcoxon Rank Sum (Mann-Whitney U) Test	Ordinal or non-normal Quantitative Data	Independent samples	$t$-test
Wilcoxon Signed Rank Sum Test	Non-normal quantitative data	Matched/Paired samples	matched pairs $t$-test
Sign Test	Ordinal data	Matched/Paired samples	matched pairs $t$-test

1.4.3. Three or More Population Tests

Test	Data Type	Data	Parametric Test Equivalent
Kruskal-Wallis Test	Ordinal or non-normal quantitative data	Independent samples	One-Way ANOVA
Friedman Test	Ordinal or non-normal quantitative data	Matched/Blocked samples	Two-Way ANOVA without interactions

1.4.4. For Relationship Between Two Variables

Test	Data Type	Data	Parametric Test Equivalent
Spearman’s Rank Correlation Test	Ordinal or non-normal	Paired observations	Pearson’s correlation

1.5. Ranking Data

Since non-parametric techniques rely on ranks instead of numerical frequencies of data, we need to understand how to rank data. We take the following steps:

We begin by ranking data in some sequence (usually ascending)
We assign ranks by identifying the relative position of each value in the ordered data
We look out for ties

If there are no ties in data values, we assign the relative position to the data values. However, if there are ties, we assign an average rank to the tied data values

Example One

Suppose we are given the data values $4,9,6,7,5,2,8$. Then, we would assign ranks as follows:

Data	4	9	6	7	5	2	8
Ordered	2	4	5	6	7	8	9
Relative Position	1	2	3	4	5	6	7

There are no ties in this data, and so we rank via the relative positions.

Example Two

Data	29	18	29	19	20	21	20	33	30	23	33	33	24
Ordered	18	19	20	20	21	23	24	29	29	30	33	33	33
Relative Position	1	2	3	4	5	6	7	8	9	10	11	12	13
Ranks	1	2	3.5	3.5	5	6	7	8.5	8.5	10	12	12	12

We have 3 ties in this data

$20$ and $20$ $\implies$ $\displaystyle{\frac{3+4}{2}}=3.5$
$29$ and $29$ $\implies$ $\displaystyle{\frac{8+9}{2}}=8.5$
$33$, $33$, and $33$ $\implies$ $\displaystyle{\frac{11+12+13}{3}}=12$

2. Wilcoxon Signed Rank Sum Test

Key Idea

A Wilcoxon Signed Rank Sum Test is used for comparing two matched, dependent samples of quantitative data (interval or ratio) with respect to central location.

It tests whether these two samples come from the same population.

Recall that, in parametric tests, we had the privilege that when we had this kind of situation, we had that the data is located around the mean. So, we would perform a paired $t$-test, taking the difference in the means of the samples and seeing if there was any significant difference between the two samples.

Since, now, we only make weak assumptions about the distribution of the data, we look at the median as the central location of differences. This means that we compare the medians of the two samples to see if there is any signifcant difference between them.

2.1. Hypotheses

The null hypothesis, $H_{0}$, is always an assumption of no significant difference between the sample medians of the two groups. For the alternative hypothesis, we can have a one-sided or a two-sided hypothesis. So,

\[ H_{0}: \text{ median of differences}=0 \text{ (i.e., no difference between samples)}\ \]

\[ \text{and} \]

\[ \\H_{1}:\text{median of differences} \neq 0 \text{ (i.e., there is a difference between samples)} \]

\[ \\H_{1}: \text{ median of differences}>0 \text{ (i.e., sample one has higher values than sample two)} \]

\[ \\H_{1}: \text{median of differences}<0 \text{ (i.e., sample two has higer values than sample one)} \]

Tip

Always give the null and alternative hypothesis in a way that references all the information given from the context in question. So, you need to make sure that your hypothesis are one-sided or two-sided based on the context, but also that the hypotheses are not general. If the question references finishing time of racers in a race, for example, then this should be evident in your hypotheses.

2.2. Data and Assumptions

Two paired samples
Quantitative data (interval or ratio)
Under the assumption of $H_{0}$, the paired differences are symmetric around the median
The $n$ paired differences are independent and random

2.3. Calculating the Test Statistic

Test Statistic (Wilcoxon Signed Rank Sum Test)

We take the following steps:

Begin by calculating the differences for each pair
Exclude the pairs with a difference of $0$.
Record $n$. This is the number of non-zero differences.
Record the sign of the pair differences.
Record the sign of the paired differences.
Rank the absolute values of the differences.
The test statistic is the given by

\[ W=\text{Sum of the Signed Ranks} \]

Question: Why does this work?

Answer: Under the assumption that $H_{0}$ is true, the differences will be randomly distributed around the median. If we take the ranks of the differences, then, each rank is equally likely to be $+$ or $-$.

Roughly, the positive and negative ranks will cancel out, and so we obtain that

\[ W\approx0 \]

So, the question of whether the positive and negative differences balanced in a symmetric way round 0 (i.e., $\text{median of differences}=0$) is answered this way.

Question: Why, though, are we opposed to taking the numerical differences and seeing if there is a true difference between the two samples?

Answers: That is what a paired t-test does. However, it assumes that the differences come from a normal distribution. So, we want to use a method or test that is going to capture this notion, without the assumption that the data follows a normal distribution.

Worked Example Part I: (The Placebo)

Before studying, a group of $6$ students are told that they are trying a new drink that supposedly improves concentration.

In reality, the drink is just flavoured water.

Each student writes a short test:

Before drinking it; and
After drinking it

and their test scores are recorded.

Person	Before	After
1	80.0	78.6
2	73.5	76.0
3	85.0	81.2
4	69.0	74.1
5	77.8	78.5
6	90.2	86.0

Is there consistent evidence that the drink improved performance at all?

Note: For now, we are only focused on how we obtain the test statistic, and not necessarily how we would conduct the whole hypothesis test

Person	Before	After	$d_{i}$ (Differences)	$\|d_{i}\|$ (Absolute Differences)	Ordered	Rank	Sign	Signed Ranks
1	$80.0$	$78.6$	$+1.4$	$1.4$	$0.7$	$1$	$-$	$-1$
2	$73.5$	$76.0$	$-2.5$	$2.5$	$1.4$	$2$	$+$	$+2$
3	$85.0$	$81.2$	$+3.8$	$3.8$	$2.5$	$3$	$-$	$-3$
4	$69.0$	$74.1$	$-5.1$	$5.1$	$3.8$	$4$	$+$	$+4$
5	$77.8$	$78.5$	$-0.7$	$0.7$	$4.2$	$5$	$+$	$+5$
6	$90.2$	$86.0$	$+4.2$	$4.2$	$5.1$	$6$	$-$	$-6$

Note: The ordered absolute differences are shuffled. So, they are not necessarily associated with the people in the columns. When finding the signed ranks, you need to associate them appropriately so that the signs are correct.

We find the test statistic as

\[ W=-1+2-3+4+5-5=+1 \]

We see, here, that the $+$ and $-$ signs are scattered across the small and large ranks, and the result is that $W=1$. This is quite close to zero. So, we may infer that there is no significant evidence to reject the null hypothesis of a difference in performance. We may conclude that the placebo did not really work.

Worked Example Part II: (The Placebo)

A second group of students takes the same focus booster drink before writing a similar test. Again, the scores are recorded:

Before drinking; and
After drinking

Person	Before	After	$d_{i}$	$\|d_{i}\|$	Order	Rank	Sign	Signed Ranks
1	$79.3$	$82.5$	$-3.2$	$3.2$	$1.1$	$1$	$+$	$+1$
2	$69.1$	$68.0$	$1.1$	$1.1$	$2$	$2$	$-$	$-2$
3	$85.4$	$91.2$	$-5.8$	$5.8$	$3.2$	$3$	$-$	$-3$
4	$73.0$	$75.0$	$-2$	$2$	$4.5$	$4$	$-$	$-4$
5	$83.8$	$88.3$	$-4.5$	$4.5$	$5.8$	$5$	$-$	$-5$
6	$62.8$	$70.1$	$-7.3$	$7.3$	$7.3$	$6$	$-$	$-6$

And we find that our test statistic is

\[ W=1-2-3-4-5-6=-19 \]

This differs vastly from zero, and tells us that the test scores after drinking the focus booster are bigger than the values before drinking the focus booseter.

2.4. So, What is $W$ Really Measuring?

$W$ tries to determine whether the signs are distributed randomly across the ranks, or is there a pattern/skew to one particular side. We have the following:

$W\approx 0 \implies \text{no evidence of a patern -- signs are random}$
$W>>0 \implies \text{first sample (before) tends to be larger}$
$W<<0 \implies \text{second sample (after) tends to be larger}$

So, in the test for a significant difference, we ask how extreme $W$ is under $H_{0}$, i.e., how different $W$ is from zero.

2.5. Sampling Distribution of $W$

Note

The sampling distribution is the distributionof a statistic (in this case, $W$) over all the possible samples (or outcomes) under a given assumption.

For $W$, the sampling distribution is given by all the possible ways of assigning $+$ or $-$ signs to the ranks since we assume, under the null hypothesis, that the signs of the differences are random, i.e., $\text{median of differences}=0$

Under $H_{0}$, the signs of the ranked differences are random, so each rank is equally likely to be positive or negative. It turns out that, for small sample sizes with no ties, it is possible to deduce the properties of the sampling distribution through simple enumeration of all possibilities

Example

Suppose that a data set has $3$ values. Then, we will get $3$ ranks from this data set. Let these ranks be $1,2, \text{ and }3$.

Each rank can take on a $+$ or $-$ sign. So, the total number of combinations of $+$ and $-$ signs is going to be given by

\[ 2^{3}=8 \] The following table shows how we can get this:

1	2	3	W
$+$	$+$	$+$	$1+2+3=6$
$-$	$+$	$+$	$-1+2+3=5$
$+$	$-$	$+$	$+3$
$+$	$+$	$-$	$0$
$-$	$-$	$+$	$0$
$-$	$+$	$-$	$-2$
$+$	$-$	$-$	$-4$
$-$	$-$	$-$	$-6$

We can then take the proportion of each of the values for $W$ across the whole group to get the sampling distribution of $W$. We get the following:

Code

#################################
# SAMPLING DISTRIBUTION OF W
#################################

W <- c(-6, -4, -2, 0, 2, 4, 6)
prob <- c(1, 1, 1, 2, 1, 1, 1) / 8


barplot(prob,
        names.arg = W,
        xlab = "W",
        ylab = "Proportion",
        main = "Proportion of Signed Differences")

You can see how this can get out of hand for larger sample sizes since, say, $9$ ranks, will lead to $2^{9}=512$ values for $W$. In this case, it may be useful to use R. Here is an example of a code that may help you perform this:

Code

##############################################
# SAMPLING DISTRIBUTION FOR W WITH 9 RANKS
##############################################

# Function to compute sampling distribution of W
wilcoxon_W_dist <- function(n) {
  ranks <- 1:n
  
  # Generate all ±1 combinations
  signs <- expand.grid(rep(list(c(-1, 1)), n))
  
  # Compute W = sum of positive ranks
  W <- apply(signs, 1, function(s) sum(ranks[s == 1]))
  
  # Convert to probability distribution
  dist <- table(W) / length(W)
  
  return(dist)
}

# Case n = 9
dist9 <- wilcoxon_W_dist(9)

# Plotting the distribution

barplot(dist9,
        xlab = "W",
        ylab = "Proportion",
        main = "Sampling Distribution of W (n = 9)")

Notice that as we increase the number of ranks $n$ increases, the sampling distribution becomes more symmetric and smooth. This suggest that for larger values of $n$, the sampling distribution of $W$ resembles a normal distribution.

In fact, for large sample sizes ($n>10$), the sampling distribution of $W$ can be approximated by a normal distribution with

a mean of $\mu_{W}=0$; and
a standard deviation of $\sigma_{W}=\displaystyle{\frac{n(n+1)(2n+1)}{6}}$

For this test, we can also have, either a

two-sided test, and we reject $H_{0}$ if $|z|>z_{\frac{\alpha}{2}}$; or
a one-sided test, and we reject $H_{0}$ if $z>z_{\alpha} \text{ (right-tailed)}$ or $z<-z_{\alpha} \text{ (left-tailed)}$

We could also use a $p$-value approach whereby we find the $p$-value corresponding to the calculated test statistic. In this case, we reject $H_{0}$ if $p<\alpha$, for a given significance level $\alpha$.

Worked Example (A case of larger values of n)

In the following, we are trying to answer the question of whether a “flexi-time” work schedule helps to reduce the travel time of workers

Note: Your brain should immediately be notifying you that this will be a one-sided test since we are looking for a reduction in the variable of interest

A random sample of $32$ workers was selected, and workers recorded their time in minutes before and after the program was implemented.

Using the modified $p$-value approach, test at the $5\%$ significance level. The full Wilcoxon table is given below

Code

#########################
# DATA FOR FLEXI-TIME
#########################

data <- data.frame(
  Worker = 1:32,
  normal_arrival = c(34,35,43,46,16,26,68,38,61,52,68,13,69,18,53,18,
            41,25,17,26,44,30,19,48,29,24,51,40,26,20,19,42),
  Flextime = c(31,31,44,44,15,28,63,39,63,54,65,12,71,13,55,19,
               38,23,14,21,40,33,18,51,33,21,50,38,22,19,21,38)
)


library(dplyr)
library(gt)

data %>%
  mutate(
    difference = normal_arrival - Flextime,
    abs_difference = abs(difference)
  ) %>%
  filter(difference != 0) %>%
  mutate(
    rank = rank(abs_difference, ties.method = "average"),
    signed_rank = rank * sign(difference)
  ) %>%
  gt()

Worker	normal_arrival	Flextime	difference	abs_difference	rank	signed_rank
1	34	31	3	3	21.0	21.0
2	35	31	4	4	27.0	27.0
3	43	44	-1	1	4.5	-4.5
4	46	44	2	2	13.0	13.0
5	16	15	1	1	4.5	4.5
6	26	28	-2	2	13.0	-13.0
7	68	63	5	5	31.0	31.0
8	38	39	-1	1	4.5	-4.5
9	61	63	-2	2	13.0	-13.0
10	52	54	-2	2	13.0	-13.0
11	68	65	3	3	21.0	21.0
12	13	12	1	1	4.5	4.5
13	69	71	-2	2	13.0	-13.0
14	18	13	5	5	31.0	31.0
15	53	55	-2	2	13.0	-13.0
16	18	19	-1	1	4.5	-4.5
17	41	38	3	3	21.0	21.0
18	25	23	2	2	13.0	13.0
19	17	14	3	3	21.0	21.0
20	26	21	5	5	31.0	31.0
21	44	40	4	4	27.0	27.0
22	30	33	-3	3	21.0	-21.0
23	19	18	1	1	4.5	4.5
24	48	51	-3	3	21.0	-21.0
25	29	33	-4	4	27.0	-27.0
26	24	21	3	3	21.0	21.0
27	51	50	1	1	4.5	4.5
28	40	38	2	2	13.0	13.0
29	26	22	4	4	27.0	27.0
30	20	19	1	1	4.5	4.5
31	19	21	-2	2	13.0	-13.0
32	42	38	4	4	27.0	27.0

We have the following hypotheses:

$H_{0}: \text{There is no difference in the travel time to work the normal and Flexi-time work programs}$

$H_{1}:\text{Workers take longer to travel to work in normal work hours}$

and we are given a significance level of $\alpha=0.05$

Here, $n=\text{number of non-zero differences}=32>10$. So, we can safely assume that the data will follow a normal distribution. We calculate the test statistic as

\[ W=\sum_{i=1}^{32} \text{rank}(d_{i})\cdot \text{sgn}(d_{i})=207 \]

We can then calculate the $z$-score associated with this test statistic as

\[ z=\frac{W-\mu_{W}}{\sigma_{W}}=\frac{207-0}{\sqrt{\frac{(32)(33)(65)}{6}}}=1.935 \]

Note: Under the assumption that $H_{0}$ is true, we have that $\mu_{W}=0$

We can then find the $p$-value of the test statistic. This is a left-handed test, since we were looking at a reduction in the time taken to arrive. So, we expect that $d>0$. Using this understanding, we can calculate the $p$-value using R

Code

#############################
# FINDING P-VALUE
#############################

p <- pnorm(1.935, lower.tail=F)

p

[1] 0.02649515

Conclusion:

Since the $p$-value is less than $0.05$, we reject the null hypothesis. We then conclude that there is significant evidence that workers take longer to travel in the normal work-hour program than they do with a Flexi-time schedule. The median difference is greater than zero.

3. Mann-Whitney-U Test

Key Idea

The Mann-Whitney-U Test (or U Test, Wilcoxon Rank Sum Test, or just Rank Sum Test) is used to determine whether two independent samples of ordinal or quantitative data have the same central location (median).

This test is the equivalent of the $t$-test for two samples of normal data.

3.1. Data and Assumptions

We have two random samples of size $n_{1}$ and $n_{2}$
The data are either ordinal or quantitative, but not normal
Samples and observations within samples are independent
The distributions of the two populations differ with respect to location only (if they differ at all)

3.2. Hypothesis Testing for Mann-Whitney-U Tests

3.2.1. Hypotheses

We differentiate between one-sided and two-sided hypotheses. So:

For a two-sided test:

\[ H_{0}:\text{the two population [medians] are the same} \]

\[ \text{and} \]

\[ H_{1}: \text{the two population medians are different} \]

For a one-sided test:

\[ H_{0}:\text{the two population [medians] are the same} \]

\[ \text{and} \]

\[ H_{1}: \text{the location of the first population is to the right of the second population} \]

\[ H_{1}:\text{the location of the first population is to the left of the second population} \]

3.2.2. Calculating the Test Statistic

The test statistic here depends on $n_{1}$ and $n_{2}$. We find it in the following way:

Combine the two samples into a single set of values
Rank all observations from the smallest to largest, i.e., from $1$ to $n_{1}+n_{2}$
Calculate the sum of the ranks, $T_{1}=\text{sum of ranks for } n_{1}$ and $T_{2}=\text{sum of ranks for } n_{2}$
We calculate two statistics:

\[ U_{1}=T_{1}-\frac{n_{1}(n_{1}+1)}{2} \]

\[ \text{and} \]

\[ U_{2}=T_{2}-\frac{n_{2}(n_{2}+1)}{2} \]

The final test statistic is given by

\[ U=\text{min}(U_{1}, U_{2}) \]

and we relate it to the specific $T$. So, if $\text{min}(U_{1},U_{2})=U_{1}$, then $T=T_{1}$ will be the test statistic.

3.2.3 Conclusion: The Logic

If the locations of the two populations are are about the same, we would expect the sum of ranks $T_{1}$ and $T_{2}$ to be close, and therefore expect that the ranks are evenly spread between the samples.

If $T_{1}$ is sufficiently small, then most of the smaller observations are in population $1$. We then conclude that the location of population $1$ is to the left of population $2$, and reject $H_{0}$.

On the other hand, if $T_{1}$ is sufficiently larger, the most of the larger observations are in population $1$. We conclude, therefore, that the location of population $1$ is to the right of population $1$.

Worked Example

Suppose we have the following samples:

$\text{Sample 1}=\{0, 1, 1,0,1,2,1,2,3\}$

$\text{Sample 2}=\{7, 9, 10, 8, 10, 11, 10, 11, 12\}$

We can combine the two samples into one set of values and rank the new set of values.

0	0	1	1	1	1	2	2	3	7	8	9	10	10	10	11	11	12
1.5	1.5	4.5	4.5	4.5	4.5	7.5	7.5	9	10	11	12	14	14	14	16.5	16.5	18

Without even calculating the test statistic, we can see that the ranks of sample 2 are much larger than those of sample 1. We can concretely show this. We get

\[ T_{1}=45 \quad \text{and} \quad T_{2}=126 \]

and so we obtain that

\[ U_{1}=0 \quad U_{2}=81 \]

Clearly, then, the test statistic is going to be

\[ T=45 \]

since we have small sample sizes ($n_{1}, n_{2}<10$), we use the Man-Whitney table to find the rejection region. We use $\alpha=0$ for this case.

We define $T_{L}$. This value can be obtained from the table with the appropriate $\alpha$ level, as the intersection of the two sample sizes (for small samples). In this case

\[ T_{L}=63 \]

We define $T_{U}=n_{1}(n_{1}+n_{2}+1)-T_{L}$. In our case, we get that

\[ T_{U}=(9)(9+9+1)-63=108 \]

We reject $H_{0}$ if $T \leq T_{L}$ or $T \geq T_{U}$.

In our case, we reject the null hypothesis since $(T=45) \leq (T_{L}=63)$. We then conclude that the location of sample one is to the left of the location of sample two. Most of the observations in sample one are smaller than the observations in sample two.

For large sample sizes, whereby $n_{1} \text{ or } n_{2}$ are bigger than zero (in the inclusive sense), then the sample distribution of the test statistic can be approximated by a normal distribution.

Mann-Whitney-U Test for Large Sample Sizes

Since $T$ is approximated by a normal distribution for large sample sizes, we can standardise $T$ to obtain a $z$ score:

\[ z=\frac{T-\mu_{T}}{\sigma_{T}} \]

where

\[ \mu_{T}=\frac{n_{1}(n_{1}+n_{2}+1)}{2} \quad \text{and} \quad \sigma_{T}=\sqrt{\frac{n_{1}n_{2}(n_{1}+n_{2}+1)}{12}} \]

Then, we reject the null hypothesis if:

$|z| \geq z_{\alpha/2}$ for a two-sided test
$z>z_{\alpha}$ for a right-tailed test
$z<-z_{\alpha}$ for a left-tailed test
$p \leq \alpha$

Sometimes, the values of $n_{1}$ and $n_{2}$ are not going to match. Given the nature in which we calculate the test statistic for this test, this is not a problem. We still proceed as we have established.

Worked Example Two

The ABC Company has sent $13$ of its employees to a privately-ran programme providing word-processing skills training. Six of the employees were from the data-processing (DP) department, and the rest where from the Typing (T) pool.

At the end of the programme, the company received a report indicating the score receieved by each of the employees out of a total possible score of $100$.

We have the following:

DP	T
70	59
52	70
46	75
65	85
60	50
40	82
	64

Is there a difference in the performance of the two groups in the word-processing programme? Test at a $5\%$ significance level.

We state the null and alternative hypotheses as

\[ H_{0}: \text{There is no diifference in the performance between the two groups} \]

\[ \text{and} \]

\[ H_{1}:\text{There is a difference in performance between the two groups} \]

We are given that $\alpha=0.05$. We then combine the two samples for ranking. This gives us the following:

Data	70	52	46	65	60	40	59	70	75	85	50	82	64
Ordered	40	46	50	52	59	60	64	65	70	70	75	82	85
Rank	1	2	3	4	5	6	7	8	9.5	9.5	11	12	13

We find, from this, that $T_{1}=30.5$ and $T_{2}=60.5$. Then,

\[ U_{1}=30.5-\frac{6(7)}{2}=9.5 \quad \text{and} \quad U_{2}=60.5-\frac{7(8)}{2}=32.5 \]

This gives the test statistic as

\[ T=30.5 \]

To calculate the test statistic, we note that $n_{1}=6$ and $n_{2}=7$. Since these are both less than $10$, we can obtain $T_{L}$ using the Mann-Whitney-U table. We get that

\[T_{L}=28\]Then,

\[ T_{U}=6(6+7+1)-28=56 \]

Conclusion: We find that $T_{L} \leq T \leq T_{U}$. So, we fail to reject the null hypothesis, and conclude that there is no evidence pf a significant difference in performance between the two groups in the word-processing skills training programme.

Worked Example Three

A pharmaceutical company is planning to introduce a new painkiller. To determine the effectiveness of the drug in comparison to asprin, $30$ people were randomly selected.

$15$ people were given the new drug (Sample $1$)
$15$ people were given asprin (Sample $2$)

Each participant was asked to indicate which one of the five statements best represented the effectiveness of the drug they took. The statements are as follows:

The drug taken was…

(5) Extremely effective
(4) Quite effective
(3) Somewhat effective
(2) Slighly effective
(1) Not effective

Note: This is ordinal data

The ratings were recorded as follows

New Drug	Asprin
3	4
5	1
4	3
3	2
2	4
5	1
1	3
4	4
5	2
3	2
3	2
5	4
5	3
5	4
4	5

At the $5\%$ significance level, is the new drug perceived to be more effective than asprin?

As usual, we start with the null and alternative hypotheses:

\[ H_{0}:\text{there is no difference in the perceived effectiveness between the two painkillers} \]

\[ H_{1}: \text{there is a diffeences between the painkillers} \]

We are given a $5\%$ significance level. We notice that $n_{1},n_{2}>10$, and so the sampling distribution of the test statistic follows a normal distribution. We find the test statistic $T$ first:

Data	Ordered	Rank
3	1	2
5	1	2
4	1	2
3	2	6
2	2	6
5	2	6
1	2	6
4	2	6
5	3	12
3	3	12
3	3	12
5	3	12
5	3	12
5	3	12
4	3	12
4	4	19.5
1	4	19.5
3	4	19.5
2	4	19.5
4	4	19.5
1	4	19.5
3	4	19.5
4	4	19.5
2	5	27
2	5	27
2	5	27
4	5	27
3	5	27
4	5	27
5	5	27

and so we obtain that $T_{1}=276.5$ and $T_{2}=188.5$. This gives us our test statistic as

\[ T=276.5 \]

Before finding the $z$-score, we calculate

\[ \mu_{T}=\frac{(15)(15+15+1)}{2}=232.5 \]

and

\[ \sigma_{T}=\sqrt{\frac{(15)(15)(15+15+1)}{12}}\approx24.11 \]

and so we obtain that

\[ z=\frac{276.5-232.5}{24.11}=1.82 \]

This is a one-sided test, and so we will reject $H_{0}$ if the test statistic is greater than

Code

####################
# CRITICAL VALUE
###################

zcrit <- qnorm(0.05, lower.tail=FALSE)
zcrit

[1] 1.644854

which it clearly is. We would also reject if the $p$-value is less that $0.05$

Code

#############
# P-VALUE
############

pval <- pnorm(1.82, lower.tail=FALSE)
pval

[1] 0.0343795

which, again, it clearly is.

Conclusion: We reject $H_{0}$ and conclude that there is significant evidence that there is a difference in effectiveness between the two drugs. That is, The new drug performs better than asprin.

4. Kruskal-Wallis Test

Key Idea

A Kruskal-Wallis test is used when we want to compare two or more independent groups/samples of ordinal data or quantitative data with respect to their medians.

It is the equivalent of a single factor ANOVA.

4.1. Data and Assumptions

The data is either ordinal or quantitative, but not necessarily normal
The treatment levels and observations within each treatment level are independent
There are, at least, three observations per group/sample
The distributions of the groups differ with respect to their location (median) only, if they differ at all

4.2. Hypothesis Testing for the Kruskal-Wallis Test

4.2.1. Hypotheses

We have the following:

\[ H_{0}: \text{the locations of the $k$ populations (groups) are the same} \]

\[ H_{1}: \text{at least two populations differ} \]

4.2.2. Calculating the Test Statistic

We combine the observations from all the $k$ groups to form one sample. This sample will have $n_{T}=\sum_{i=1}^{k}n_{j}$ observations.
Then, we rank the observations, averaging ranks for all tied observations
We calculate the sum of ranks, $T_{1}, T_{2},\dots,T_{k}$, for all the $k$ groups

Note

As a consequence of this, we have that

\[ \sum_{i=1}^{k}T_{i}=\frac{n_{T}(n_{T}+1)}{2} \]

The test statistic is then given by

\[ H=\left[\frac{12}{n_{T}(n_{T}+1)}\sum_{i=1}^{k}\left(\frac{T^{2}_{i}}{n_{i}}\right)\right]-3(n_{T}+1) \]

Note

If all the populations have the same location, i.e. $H_{0}$ is true, then the ranks should be evenly distributed among the $k$ samples and the $H$ statistic will be small.

Here, “small” means “sufficiently close to zero”

4.2.3. Critical Region

When the sample sizes of the $k$ groups is at least three, the sampling distribution of $H$ is a chi-squared distribution with $k-1$ degrees of freedom. Thus, the test is one-sided, and we reject $H_{0}$ if $H$ is too large ($H \geq c$) for some critical value $c$, or if $p \leq \alpha$ for some defined significance level $\alpha$.

Note

If you are wondering how we calculate the critical region for when $n_{i}<3$, we don’t. The Kruskal-Wallis test is particularly defined for $n_{i} \geq 3$ for the $k$ groups. It so happens that the test statistic follows a chi-squared distribution for this.

Worked Example

A 24hr restaurant wanted to determine how customers rate three shifts with respect to speed of service. Three samples of $10$ customer response-cards were randomly selected, one sample from each shift, and customer ratings (from $1$ for “very slow” to $5$ for “very quick”) were recorded. The ranked data was recorded in the following table

4:00 - midd	midd - 8:00	8:00 - 4:00
4 (27)	3 (16.5)	3 (16.5)
4 (27)	4 (27)	1 (2)
3 (16.5)	2 (6.5)	3 (16.5)
4 (27)	2 (6.5)	2 (6.5)
3 (16.5)	3 (16.5)	1 (2)
3 (16.5)	4 (27)	3 (16.5)
3 (16.5)	3 (16.5)	4 (27)
3 (16.5)	3 (16.5)	2 (6.5)
2 (6.5)	2 (6.5)	4 (27)
3 (16.5)	3 (16.5)	1 (2)

Can we conclude that customers perceive the speed of service to be different among the three shifts at a 5 percent significance level?

We have our hypotheses:

\[ H_{0}: \text{there is no difference in perception of the speed of service} \]

\[ H_{1}: \text{there is a difference in the perception of the speed of service} \]
From the table, we find that

\[ T_{1}=186.5 \quad T_{2}=156 \quad T_{3}=122.5 \]

and we can calculate the test statistic as

\[ H=\frac{12}{30(30+1)}\left(\frac{(186.5)^{2}}{10}+\frac{(156)^{2}}{10}+\frac{(122.5)^{2}}{10}\right)-3(30+1)=2.645 \]

we can calculate the critical region

Code

##########
# CRIT
#########

k <- 3
chi_crit <- qchisq(0.05, df=k-1, lower.tail=F)
chi_crit

[1] 5.991465

and the $p$-value

Code

##############
# p-value
##############

p <- pchisq(2.645, k-1, lower.tail=F)
p

[1] 0.2664683

Conclusion: In this case, we fail to reject the null hypothesis since our test statistic is not more extreme than the critical value, and $p>0.05$. We then conclude that there is no evidence of a difference in the perception of speed of service between the different shifts.

5. Friedman Test

Key Idea

A Friedman test is used when comapring more than two groups or samples of ordinal or quantitative data, using matched or blocked samples, with respect to their (median) locations.

A Friedman test is the equivalent of an randomised block design two-way ANOVA without interactions

5.1. Data and Assumptions

Data is either ordinal or quantitative, but not normal
The data comes from a blocked experiment with b blocks
The measurements within a block are dependent
The measurements between blocks are independent
No interaction between blocks and treatments

5.2. Hypothesis Testing for the Friedman Test

Before going deep into how we perform a hypothesis test for the Friedman test, it is worth looking at the structure of the experiments for which the test is used to investigate.

Recall that blocking is introduced into an experiment to improve comaprison of the treatments by grouping the experimental units into blocks based on them being the same with regards to some characteristic. These blocks will have the same number of experimental units, each having the treatment occurring once. So,

\[ \text{number of units in each block}=\text{number of treatments} \]

Here is an example of this:

Treatment	Block 1	Block 2	Block 3	Block 4
1	$y_{11}$	$y_{12}$	$y_{13}$	$y_{14}$
2	$y_{21}$	$y_{22}$	$y_{23}$	$y_{24}$
3	$y_{31}$	$y_{32}$	$y_{33}$	$y_{34}$
4	$y_{41}$	$y_{42}$	$y_{43}$	$y_{44}$
5	$y_{51}$	$y_{52}$	$y_{53}$	$y_{54}$

So, we will end up measuring whether the $k$ treatment groups differ in their median.

5.2.1. Hypotheses

We have the following:

\[ H_{0}: \text{the locations of the $k$ populations are the same} \]

\[ \text{and} \]

\[ H_{1}: \text{at least two population locations differ} \]

Tip

Remember to interpret your hypotheses based on the context of the question which you are trying to answer

5.2.2. Calculating the Test Statistic

Rank the observations from smallest to largest within each block
Average ranks of tied observations within the same block
Calculate the rank sums $T_{1}, T_{2}, \dots, T_{k}$ for all the $k$ treatments

The test statistic is then given by

\[ F_{r}=\left[\frac{12}{b(k)(k+1)}\sum_{j=1}^{k}T_{j}^{2}\right]-3b(k+1) \]

where

$b$ is the number of blocks
$k$ is the number of treatments; and

$F_{r}$ is the actual test statistic which has a chi-squared distribution (approximately) provided that $k \geq5$ or $b \geq 5$ with $k-1$ degrees of freedom

We then reject the null hypothesis if $F_{r}$ is too large under the assumption of the null hypothesis

Worked Example

Four managers evaluate applicants for a job in an accounting firm on several dimensions including academic credentials, previous work experience and personal suitability. Each manager then summarises the results and produces an evaluation of the candidates. There are $5$ possibilities:

The candidate is in the top $5\%$ of applicants
The candidate is in the top $10\%$ of applicants, but not the the top $5\%$
The candidate is in the top $25\%$ of applicants, but not in the top $10\%$
The candidate is in the top $50\%$ of applicants, but not in the top $25\%$
The candidate is in the bottom $50\%$ of applicants

Eight applicants were chosen at randomly selected, and their evaluations by the four managers were recorded.

Applicant	Manager 1	Manager 2	Manager 3	Manager 4
1	2	1	2	2
2	4	2	3	2
3	2	2	2	3
4	3	1	3	2
5	3	2	3	5
6	2	2	3	3
7	4	1	5	5
8	3	2	5	3

Can we say that there are differences in the way the managers evaluate candidates?

Here, we are trying to determine how getting scored by a particular manager affects where the applicants are placed in the candidacy groups. So, the treatments are the managers. The blocking factor are the applicants themselves since the treatments are applied to all the applicants.

To find the treatments, always ask yourself, “What effect are we trying to measure?” Since we are trying to measure the effect that each manager has on the scoring, that is our treatment – the managers.

Usually, then, the blocks will follow from this. However, you can ask yourself “What is being measured repeatedly for each treatment?”

Notice, also, that the observations within each block (the applicants) are dependent since they are measured on the same applicant. This makes sense since a stronger applicant is very likely to score higher across all groups.

For, the hypotheses, we have

\[ H_{0}: \text{there is no difference in the way that managers evaluate candidates} \]

\[ H_{1}: \text{there is a difference in the way that managers evaluate candidates} \]

To calculate the test statistic, we first rank within the blocks to obtain the sum of ranks. We have the following:

Applicant	Manager 1	Manager 2	Manager 3	Manager 4
1	2 (3)	1 (1)	2 (3)	2 (3)
2	4 (4)	2 (1.5)	3 (3)	2 (1.5)
3	2 (2)	2 (2)	2 (2)	3 (4)
4	3 (3.5)	1 (1)	3 (3.5)	2 (2)
5	3 (2.5)	2 (1)	3 (2.5)	5 (4)
6	2 (1.5)	2 (1.5)	3 (3)	4 (4)
7	4 (2)	1 (1)	5 (3.5)	5 (3.5)
8	3 (2.5)	2 (1)	5 (4)	3 (2.5)

and we get the sum of ranks as $T_{1}=21$, $T_{2}=10$, $T_{3}=24.5$, and $T_{4}=24.5$. We can then calculate the test statistic. We obtain that

\[ F_{r}=\left[\frac{12}{(8)(4)(4+1)}\left((21)^{2}+(10)^{2}+(24.5)^{2}+(24.5)^{2}\right)\right]-3(8)(4+1)=10.61 \]

We can find the critical value (and therefore the critical region)

Code

#################### 
# CRITICAL VALUE  
####################

k <- 4
crit <- qchisq(0.05, df=k-1, lower.tail=F)
crit

[1] 7.814728

and the $p$-value associated with the test statistic.

Code

############ 
# P VALUE  
############

p <- pchisq(10.61, df=k-1, lower.tail=F)
p

[1] 0.01403297

Conclusion: Based on the test statistic being more extreme than the critical value, and having a $p$-value less than $0.05$, we reject the null hypothesis and conclude that there is evidence of a difference in the way that the different managers evaluate the candidates.

6. Spearman Rank Correlation Coefficient Test

Key Idea

The Spearman Rank Correlation Coefficient Test is used to measure the association between two samples/variables of ordinal or quantitative data

This test is equivalent to the Pearson’s Correlation Coefficient Test

6.1 Data and Assumptions

Both variables are, at least, ordinal (though, they may be quantitative), and at least one variable is not normal
There are a total of $n$ randomly selected paired observations

Note

Sprearman’s rank correlation coefficient is interpreted the same way as Pearson’s correlation. That is,

\[ -1 \leq r_{s} \leq 1 \]

and

$-1 \implies$ perfect negative relationship
$-0.5 \implies$ moderate negative relationship
$0 \implies$ no relationship
$0.5 \implies$ moderate positive relationship
$+1 \implies$ perfect positive relationship

6.2. Hypothesis Testing for Spearman Rank Correlation Test

6.2.1. Hypotheses

The null hypothesis is given by

\[ H_{0}: \rho_{s}=0 \text{ (no association between the two variables in the underlying population)} \]

and the alternative hypotheses can either be one-sided or two-sided. For a two-sided alternative hypothesis, we have

\[ H_{1}: \rho_{s} \neq 0 \text{ (there is an association between the two uvariables in the underlying population)} \]

and, for the one-sided alternative hypotheses, we have

\[ H_{1}: \rho_{s}>0 \text{ (positive correlation)} \]

\[ \text{and} \]

\[ H_{1}: \rho_{s}<0 \text{ (negative correlation)} \]

6.2.2. Calculating the Test Statistic

To calculate the test statistic, we

Rank rhe populations separately
Calculate the difference, $d$, within each pair of ranks. So,

\[ d_{i}=\text{rank}(x_{i})-\text{rank}(y_{i}) \]
The test statistic is then given by

\[ r_{s}=1-\frac{6\sum_{i=1}^{n}d^{2}_{i}}{n(n^{2}-1)} \]

where $n$ is the number of pairs of data

For large samples ($n \geq 10$), the sampling distribution of the test statistic, $r_{s}$ is approximately normal, and the test $z$-score is given by

\[ z=\frac{r_{s}-\mu_{r_{s}}}{\sigma_{r_{s}}} \]

where $\mu_{r_{s}} = 0$ under the assumption that $H_{0}$ is true and $\sigma_{r_{s}} = \sqrt{\frac{1}{n-1}}=\frac{1}{\sqrt{n-1}}$. From, this, we can simplify the $z$ calculation by observing that

\[ z=r_{s}\sqrt{n-1} \]

under the assumption that $H_{0}$ is true.

6.2.3. Conclusion

We the reject the null hypothesis if

$|z| \geq z_{\alpha/2}$ for a two-sided test
$z>z_{\alpha}$ for a right-tailed test; and
$z<-z_{\alpha}$ for a left-tailed test; OR
if the $p$-value is less than the defined $\alpha$

Worked Example

After several semesters without much success, Pat Statstud (a student in the lowest quarter of a statistics course) decided to try and improve his performance. Pat needed to know the secret of success for university students.

After many hours of discussion with other more successful students, Pat postulated a rather radical theory: the longer one studied, the better one’s grade.

To test the theory, Pat took a random sample of 35 students in an economics course and asked each to report the average amount of time he or she studied economics, and the final mark (out of 100) obtained (see results on next slide).

Test to determine whether grade and study time are positively related.

The ranked data is as follows.

Code

###############################
# STUDY TIME VS MARK DATA
###############################


library(dplyr)
library(gt)

# Left block
left <- tibble(
  Time = c(30, 5, 36, 37, 32, 23, 34, 2, 34, 43, 34, 32, 30, 36, 40, 24, 0, 25),
  Rank_Time = c(17, 4, 30.5, 32, 22.5, 7, 28, 2.5, 28, 35, 28, 22.5, 17, 30.5, 34, 8.5, 1, 10.5),
  Mark = c(71, 30, 82, 98, 78, 73, 82, 25, 94, 99, 85, 74, 79, 82, 88, 55, 7, 62),
  Rank_Mark = c(9, 4, 17.5, 34, 14, 10.5, 17.5, 3, 32, 35, 22, 12, 15, 17.5, 26, 5, 1, 6)
)

# Right block
right <- tibble(
  Time = c(29, 21, 31, 30, 33, 30, 33, 22, 29, 24, 30, 2, 31, 33, 25, 38, 26),
  Rank_Time = c(13.5, 5, 20.5, 17, 25, 17, 25, 6, 13.5, 8.5, 17, 2.5, 20.5, 25, 10.5, 33, 12),
  Mark = c(91, 66, 66, 73, 90, 88, 91, 64, 83, 87, 96, 16, 84, 92, 82, 88, 75),
  Rank_Mark = c(29.5, 8, 23, 10.5, 28, 26, 29.5, 7, 20, 24, 33, 2, 21, 31, 17.5, 26, 13)
)

# Combine and format
data <- bind_rows(left, right)

data %>%
  gt() %>%
  tab_header(
    title = "Study Time vs Marks Dataset"
  ) %>%
  fmt_number(
    columns = everything(),
    decimals = 1
  ) %>%
  tab_options(
    table.font.size = "small"
  )

Time	Rank_Time	Mark	Rank_Mark
Study Time vs Marks Dataset
30.0	17.0	71.0	9.0
5.0	4.0	30.0	4.0
36.0	30.5	82.0	17.5
37.0	32.0	98.0	34.0
32.0	22.5	78.0	14.0
23.0	7.0	73.0	10.5
34.0	28.0	82.0	17.5
2.0	2.5	25.0	3.0
34.0	28.0	94.0	32.0
43.0	35.0	99.0	35.0
34.0	28.0	85.0	22.0
32.0	22.5	74.0	12.0
30.0	17.0	79.0	15.0
36.0	30.5	82.0	17.5
40.0	34.0	88.0	26.0
24.0	8.5	55.0	5.0
0.0	1.0	7.0	1.0
25.0	10.5	62.0	6.0
29.0	13.5	91.0	29.5
21.0	5.0	66.0	8.0
31.0	20.5	66.0	23.0
30.0	17.0	73.0	10.5
33.0	25.0	90.0	28.0
30.0	17.0	88.0	26.0
33.0	25.0	91.0	29.5
22.0	6.0	64.0	7.0
29.0	13.5	83.0	20.0
24.0	8.5	87.0	24.0
30.0	17.0	96.0	33.0
2.0	2.5	16.0	2.0
31.0	20.5	84.0	21.0
33.0	25.0	92.0	31.0
25.0	10.5	82.0	17.5
38.0	33.0	88.0	26.0
26.0	12.0	75.0	13.0

We start with the null and alternative hypotheses. The null hypothesis is given by

\[ H_{0}: \text{more time spent studying doesn't improve one's grade } (\rho_{s}=0) \]

and the alternative hypothesis is

\[ H_{1}: \text{more time spent studying improvesone's grade} (\rho_{s}>0) \]

We will test at the $5\%$ significance level. To calculate the test statistic, we will need the differences. These are given in the table below.

Code

###############################
# STUDY TIME VS MARK DATA
###############################


library(dplyr)
library(gt)

# Left block
left <- tibble(
  Time = c(30, 5, 36, 37, 32, 23, 34, 2, 34, 43, 34, 32, 30, 36, 40, 24, 0, 25),
  Rank_Time = c(17, 4, 30.5, 32, 22.5, 7, 28, 2.5, 28, 35, 28, 22.5, 17, 30.5, 34, 8.5, 1, 10.5),
  Mark = c(71, 30, 82, 98, 78, 73, 82, 25, 94, 99, 85, 74, 79, 82, 88, 55, 7, 62),
  Rank_Mark = c(9, 4, 17.5, 34, 14, 10.5, 17.5, 3, 32, 35, 22, 12, 15, 17.5, 26, 5, 1, 6)
)

# Right block
right <- tibble(
  Time = c(29, 21, 31, 30, 33, 30, 33, 22, 29, 24, 30, 2, 31, 33, 25, 38, 26),
  Rank_Time = c(13.5, 5, 20.5, 17, 25, 17, 25, 6, 13.5, 8.5, 17, 2.5, 20.5, 25, 10.5, 33, 12),
  Mark = c(91, 66, 66, 73, 90, 88, 91, 64, 83, 87, 96, 16, 84, 92, 82, 88, 75),
  Rank_Mark = c(29.5, 8, 23, 10.5, 28, 26, 29.5, 7, 20, 24, 33, 2, 21, 31, 17.5, 26, 13)
)

# Combine and format
data <- bind_rows(left, right)

data %>%
  mutate(
    d_i = Rank_Time - Rank_Mark
  ) %>%
  gt() %>%
  tab_header(
    title = "Study Time vs Marks Dataset (with Differences)"
  ) %>%
  fmt_number(
    columns = everything(),
    decimals = 1
  ) %>%
  tab_options(
    table.font.size = "small"
  )

Time	Rank_Time	Mark	Rank_Mark	d_i
Study Time vs Marks Dataset (with Differences)
30.0	17.0	71.0	9.0	8.0
5.0	4.0	30.0	4.0	0.0
36.0	30.5	82.0	17.5	13.0
37.0	32.0	98.0	34.0	−2.0
32.0	22.5	78.0	14.0	8.5
23.0	7.0	73.0	10.5	−3.5
34.0	28.0	82.0	17.5	10.5
2.0	2.5	25.0	3.0	−0.5
34.0	28.0	94.0	32.0	−4.0
43.0	35.0	99.0	35.0	0.0
34.0	28.0	85.0	22.0	6.0
32.0	22.5	74.0	12.0	10.5
30.0	17.0	79.0	15.0	2.0
36.0	30.5	82.0	17.5	13.0
40.0	34.0	88.0	26.0	8.0
24.0	8.5	55.0	5.0	3.5
0.0	1.0	7.0	1.0	0.0
25.0	10.5	62.0	6.0	4.5
29.0	13.5	91.0	29.5	−16.0
21.0	5.0	66.0	8.0	−3.0
31.0	20.5	66.0	23.0	−2.5
30.0	17.0	73.0	10.5	6.5
33.0	25.0	90.0	28.0	−3.0
30.0	17.0	88.0	26.0	−9.0
33.0	25.0	91.0	29.5	−4.5
22.0	6.0	64.0	7.0	−1.0
29.0	13.5	83.0	20.0	−6.5
24.0	8.5	87.0	24.0	−15.5
30.0	17.0	96.0	33.0	−16.0
2.0	2.5	16.0	2.0	0.5
31.0	20.5	84.0	21.0	−0.5
33.0	25.0	92.0	31.0	−6.0
25.0	10.5	82.0	17.5	−7.0
38.0	33.0	88.0	26.0	7.0
26.0	12.0	75.0	13.0	−1.0

Now, we are ready to calculate the test statistic.

\[ r_{s}=1-6\left[\frac{(8)^{2}+(0)^{2}+(13)^{2}+\dots+(7)^{2}+(-1)^{2}}{35((35)^{2}-1)}\right]\approx0.7251 \]

and the associated $z$-score will be

\[ z=0.7251\sqrt{35-1}=4.228 \]

The critical region (and note, this is a one-sided test) is given by

Code

####################
# CRITICAL POINT
####################

zcrit <- qnorm(0.05, lower.tail=F)
zcrit

[1] 1.644854

$z \geq 1.645$, and the $p$-value for the test statistic is given by

Code

##################
# P-VALUE
#################

pv <- pnorm(4.228, lower.tail=F)
pv

[1] 1.178889e-05

Conclusion: We reject the null hypothesis since the test statistic falls into the rejection region, and the $p$-value is less than the significance level defined ($\alpha=0.05$). We then conclude that there is significant evidence of a positive relationship between the amount of time spent studying and the grade of a student.

7. Advantages and Disadvantages of Non-Parametric Statistical Techniques

Advantages of Non-Parameteric Tests

Can be used when parametric techniques are not suited for the data samples given, and the validity of their assumptions is uncertain
Useful for small sample sizes
The assumptions are usually few and easily met
They are not just restricted to quantitative data

Disadvantages of Non-Parametric Tests

Information is lost by ranking or taking signed ranks. As a result, we lose more power (the probability of rejecting the null hypothesis when it is, in fact, false) compared to the equivalent parametric tests (when one is appropriate for the data)

--- title: "Non-Parametrics" author: "J Sigma" editor: source format: html: css: styles.css toc: true toc-depth: 3 number-sections: false theme: cosmo code-fold: true code-tools: true smooth-scroll: true embed-resources: true page-navigation: true execute: engine: knitr echo: true warning: false message: false --- # 1. Introduction ## 1.1. Parametric Techniques The statistical techniques we have used thus far are **parametric techniques** ::: {.callout-important title="Parametric Techniques"} **Parametric Techniques** are statistical techniques which - Assume that data follows a particular distribution - Assume that the data is normally distributed for large sample sizes by the **central limit theorem** - Typically rely on quantitative data - Via the assumption of distribution, sampling distributions of the test statistics are derived, and inferences are made about the unknown population parameters of the particular distribution - Heavy focus on parameters ::: ::: {.callout-warning title="Example" icon="false"} We may attempt to find the mean height of first-year students at *UCT* and find that, for this data, the standard deviation of the heights is also unknown. By collecting random samples, we the find sample means and sample standard deviations, and assume that the data will follow a $t$-distribution. We can then calculate a test-statistic for the mean height and use inference to find whether this is significant at some $\alpha$ level. This heavily depends on an assumption of normality. ::: ## 1.2. Non-Parametric Techniques ::: {.callout-important title="Non-Parametric Techniques"} **Non-parametric techniques** are statistical techniques which are valid for a wide variety of underlying distributions, because they - Only make weak assumptions about the distribution of the data - Do not depend on parameters specific to a particular distribution ::: As such, we use non-parametric techniques when - we have non-normal quantitative data or when the distribution of data is uncertain - when we have small samples - when we have qualitative data ::: callout-note Non-parametric techniques can also be use for when data is normally distributed. So, they are not limited to non-normal data. However, if we know the underlying distribution of a particular data, it is better to use parametric techniques since this gives us more **power**, i.e., the probability of correctly rejected a false null hypothesis. So, non-parametric techniques are always valid, but they are sometimes not the optimal choice for power. ::: ## 1.3. Data Types We differentiate between **qualitative** and **quantitative data** ### 1.3.1. Qualitative (or Categorical) Data ::: {.callout-important title="Definition (Qualitative Data)"} **Qualitative data** refers to data that represents categories or labels, or levels of a factor. If the data is numbered, then the levels have no arithmetic meaning ::: Examples of categorical data may include gender, nationality, blood type, colours, and many more. The values of the categories here describe what something is, and not how much of it there is. We further divide qualitative data into **nominal** and **ordinal** **data** ::: {.callout-warning title="Nominal vs Ordinal Data" icon="false"} **Nominal data** refers to data which has categories that can be listed without any particular order, and this doesn't change the meaning of the data. For example, we may measure the number of people belonging to blood groups A, B, AB, and O. Here, there is no notion of any blood group having a higher value than any other. On the other hand, **ordinal data** refers to categorical data that has a clear order structure. For example, we may consider looking at the year level of undergraduate students in University. We will have categorical groups, 1st year, 2nd year, 3rd year, and so on, but there is a clear order to these groups. ::: ### 1.3.2. Quantitative Data ::: {.callout-important title="Definition (Quantitative Data)"} **Quantitative data** represents measurable quantities, where numerical values have meaningful arithmetic interpretation. ::: Examples of categorical data includes things like height, income, time, number of lectures attended in a course. These are not just labels; the carry some magnitude meaning. Similar to qualitative data, we further differentiate between **interval** and **ratio-scaled quantitative data**. ::: {.callout-warning title="Inverval vs Ratio-Scaled Data" icon="false"} **Ratio-scaled quantitative data** refers to quantitative data where the $0$ value has true meaning, in that it refers to an absence of the quantity being measured. Examples where this is the case is height, weight, and temperature when measured in Kelvin ($0$K indicates an absence of thermal energy), and duration. Also, ratios between values for this type of data is meaningful. We may say things like, $20$ kilograms is twice as heavy as $10$ kilograms. **Interval quantitative data** refers ti quantitative data where the zero value has no physical or real meaning. Here, the $0$ value does not mean an absence of the quantity being measured. Examples include IQ score (where $0$ IQ doesn't mean an absence of intelligence, but just a lack thereof), time, and temperature when measured in degrees symbols ($0^{\circ}C$ or $0^{\circ}F$ do not indicate an absence of temperature). Ratios between values are not meaningful here. For example, $20^{\circ}C$ is not twice as hot as $10^{\circ}C$. To show this, we can convert both to Kelvin (since it has a meaningful zero) and see what the percentage change is: $10^{\circ}C=283K$ and $20^{\circ}=293K$. So, the change can be found by $$ \frac{293-283}{293}=0,03412...\approx0,34 $$ This shows that $20^{\circ}C$ is only about $3,4\%$ hotter than $10^{\circ}C$. ::: ## 1.4. Overview of Non-Parametric Tests ### 1.4.1. Single Population Tests +-------------------------------------+---------------+--------------------------+ | Test | Data Type | Data | +=====================================+===============+==========================+ | **Tests for Randomness of Order** | Nominal | Independent Observations | +-------------------------------------+---------------+--------------------------+ | **Chi-Square Goodness of Fit Test** | Nominal | Independent Observations | +-------------------------------------+---------------+--------------------------+ ### 1.4.2. Two Population Tests +---------------------------------------------+-----------------------------------------+------------------------+----------------------------+ | Tests for Equality of Medians | Data Type | Data | Parametric Test Equivalent | +=============================================+=========================================+========================+============================+ | **Wilcoxon Rank Sum (Mann-Whitney U) Test** | Ordinal or non-normal Quantitative Data | Independent samples | $t$-test | +---------------------------------------------+-----------------------------------------+------------------------+----------------------------+ | **Wilcoxon Signed Rank Sum Test** | Non-normal quantitative data | Matched/Paired samples | matched pairs $t$-test | +---------------------------------------------+-----------------------------------------+------------------------+----------------------------+ | **Sign Test** | Ordinal data | Matched/Paired samples | matched pairs $t$-test | +---------------------------------------------+-----------------------------------------+------------------------+----------------------------+ ### 1.4.3. Three or More Population Tests +-------------------------+-----------------------------------------+-------------------------+------------------------------------+ | Test | Data Type | Data | Parametric Test Equivalent | +=========================+=========================================+=========================+====================================+ | **Kruskal-Wallis Test** | Ordinal or non-normal quantitative data | Independent samples | One-Way ANOVA | +-------------------------+-----------------------------------------+-------------------------+------------------------------------+ | **Friedman Test** | Ordinal or non-normal quantitative data | Matched/Blocked samples | Two-Way ANOVA without interactions | +-------------------------+-----------------------------------------+-------------------------+------------------------------------+ ### 1.4.4. For Relationship Between Two Variables +--------------------------------------+-----------------------+---------------------+----------------------------+ | Test | Data Type | Data | Parametric Test Equivalent | +======================================+=======================+=====================+============================+ | **Spearman's Rank Correlation Test** | Ordinal or non-normal | Paired observations | Pearson's correlation | +--------------------------------------+-----------------------+---------------------+----------------------------+ ## 1.5. Ranking Data Since non-parametric techniques rely on ranks instead of numerical frequencies of data, we need to understand how to rank data. We take the following steps: 1. We begin by ranking data in some sequence (usually ascending) 2. We assign ranks by identifying the relative position of each value in the ordered data 3. We look out for **ties** If there are no ties in data values, we assign the relative position to the data values. However, if there are ties, we assign an average rank to the tied data values ::: {.callout-warning title="Example One" icon="false"} Suppose we are given the data values $4,9,6,7,5,2,8$. Then, we would assign ranks as follows: | Data | 4 | 9 | 6 | 7 | 5 | 2 | 8 | |:---------------------:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | **Ordered** | 2 | 4 | 5 | 6 | 7 | 8 | 9 | | **Relative Position** | 1 | 2 | 3 | 4 | 5 | 6 | 7 | There are no ties in this data, and so we rank via the relative positions. ::: ::: {.callout-warning title="Example Two" icon="false"} | | | | | | | | | | | | | | | |:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:| | **Data** | **29** | **18** | **29** | **19** | **20** | **21** | **20** | **33** | **30** | **23** | **33** | **33** | **24** | | **Ordered** | 18 | 19 | 20 | 20 | 21 | 23 | 24 | 29 | 29 | 30 | 33 | 33 | 33 | | **Relative Position** | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | | **Ranks** | **1** | **2** | **3.5** | **3.5** | **5** | **6** | **7** | **8.5** | **8.5** | **10** | **12** | **12** | **12** | We have 3 ties in this data - $20$ and $20$ $\implies$ $\displaystyle{\frac{3+4}{2}}=3.5$ - $29$ and $29$ $\implies$ $\displaystyle{\frac{8+9}{2}}=8.5$ - $33$, $33$, and $33$ $\implies$ $\displaystyle{\frac{11+12+13}{3}}=12$ ::: # 2. Wilcoxon Signed Rank Sum Test ::: {.callout-important title="Key Idea"} A **Wilcoxon Signed Rank Sum Test** is used for comparing two matched, dependent samples of quantitative data (interval or ratio) with respect to central location. It tests whether these two samples come from the same population. ::: Recall that, in parametric tests, we had the privilege that when we had this kind of situation, we had that the data is located around the mean. So, we would perform a **paired** $t$**-test**, taking the difference in the means of the samples and seeing if there was any significant difference between the two samples. Since, now, we only make weak assumptions about the distribution of the data, we look at the median as the central location of differences. This means that we compare the medians of the two samples to see if there is any signifcant difference between them. ## 2.1. Hypotheses The null hypothesis, $H_{0}$, is always an assumption of no significant difference between the sample medians of the two groups. For the alternative hypothesis, we can have a one-sided or a two-sided hypothesis. So, $$ H_{0}: \text{ median of differences}=0 \text{ (i.e., no difference between samples)}\ $$ $$ \text{and} $$ $$ \\H_{1}:\text{median of differences} \neq 0 \text{ (i.e., there is a difference between samples)} $$ $$ \\H_{1}: \text{ median of differences}>0 \text{ (i.e., sample one has higher values than sample two)} $$ $$ \\H_{1}: \text{median of differences}<0 \text{ (i.e., sample two has higer values than sample one)} $$ ::: callout-tip Always give the null and alternative hypothesis in a way that references all the information given from the context in question. So, you need to make sure that your hypothesis are one-sided or two-sided based on the context, but also that the hypotheses are not general. If the question references finishing time of racers in a race, for example, then this should be evident in your hypotheses. ::: ## 2.2. Data and Assumptions 1. Two paired samples 2. Quantitative data (interval or ratio) 3. Under the assumption of $H_{0}$, the paired differences are symmetric around the median 4. The $n$ paired differences are independent and random ## 2.3. Calculating the Test Statistic ::: {.callout-important title="Test Statistic (Wilcoxon Signed Rank Sum Test)" icon="false"} We take the following steps: 1. Begin by calculating the differences for each pair 2. Exclude the pairs with a difference of $0$. 3. Record $n$. This is the number of non-zero differences. 4. Record the sign of the pair differences. 5. Record the sign of the paired differences. 6. Rank the absolute values of the differences. 7. The test statistic is the given by $$ W=\text{Sum of the Signed Ranks} $$ ::: ***Question: Why does this work?*** *Answer: Under the assumption that* $H_{0}$ *is true, the differences will be randomly distributed around the median. If we take the ranks of the differences, then, each rank is equally likely to be* $+$ or $-$. *Roughly, the positive and negative ranks will cancel out, and so we obtain that* $$ W\approx0 $$ *So, the question of whether the positive and negative differences balanced in a symmetric way round 0 (i.e.,* $\text{median of differences}=0$) *is answered this way.* ***Question: Why, though, are we opposed to taking the numerical differences and seeing if there is a true difference between the two samples?*** *Answers:* *That is what a paired t-test does. However, it assumes that the differences come from a normal distribution. So, we want to use a method or test that is going to capture this notion, without the assumption that the data follows a normal distribution.* ::: {.callout-warning title="Worked Example Part I: (The Placebo)" icon="false"} Before studying, a group of $6$ students are told that they are trying a new drink that supposedly improves concentration. In reality, the drink is just flavoured water. Each student writes a short test: - Before drinking it; and - After drinking it and their test scores are recorded.\ +--------+--------+-------+ | Person | Before | After | +:======:+:======:+:=====:+ | **1** | 80.0 | 78.6 | +--------+--------+-------+ | **2** | 73.5 | 76.0 | +--------+--------+-------+ | **3** | 85.0 | 81.2 | +--------+--------+-------+ | **4** | 69.0 | 74.1 | +--------+--------+-------+ | **5** | 77.8 | 78.5 | +--------+--------+-------+ | **6** | 90.2 | 86.0 | +--------+--------+-------+ Is there consistent evidence that the drink improved performance at all?\ \ **Note: For now, we are only focused on how we obtain the test statistic, and not necessarily how we would conduct the whole hypothesis test\ ** +--------+--------+--------+---------------+------------------------+---------+------+------+--------------+ | Person | Before | After | $d_{i}$ | $|d_{i}|$ | Ordered | Rank | Sign | Signed Ranks | | | | | | | | | | | | | | | (Differences) | (Absolute Differences) | | | | | +:======:+:======:+:======:+:=============:+:======================:+:=======:+:====:+:====:+:============:+ | **1** | $80.0$ | $78.6$ | $+1.4$ | $1.4$ | $0.7$ | $1$ | $-$ | $-1$ | +--------+--------+--------+---------------+------------------------+---------+------+------+--------------+ | **2** | $73.5$ | $76.0$ | $-2.5$ | $2.5$ | $1.4$ | $2$ | $+$ | $+2$ | +--------+--------+--------+---------------+------------------------+---------+------+------+--------------+ | **3** | $85.0$ | $81.2$ | $+3.8$ | $3.8$ | $2.5$ | $3$ | $-$ | $-3$ | +--------+--------+--------+---------------+------------------------+---------+------+------+--------------+ | **4** | $69.0$ | $74.1$ | $-5.1$ | $5.1$ | $3.8$ | $4$ | $+$ | $+4$ | +--------+--------+--------+---------------+------------------------+---------+------+------+--------------+ | **5** | $77.8$ | $78.5$ | $-0.7$ | $0.7$ | $4.2$ | $5$ | $+$ | $+5$ | +--------+--------+--------+---------------+------------------------+---------+------+------+--------------+ | **6** | $90.2$ | $86.0$ | $+4.2$ | $4.2$ | $5.1$ | $6$ | $-$ | $-6$ | +--------+--------+--------+---------------+------------------------+---------+------+------+--------------+ **Note: The ordered absolute differences are shuffled. So, they are not necessarily associated with the people in the columns. When finding the signed ranks, you need to associate them appropriately so that the signs are correct.\ ** We find the test statistic as $$ W=-1+2-3+4+5-5=+1 $$ We see, here, that the $+$ and $-$ signs are scattered across the small and large ranks, and the result is that $W=1$. This is quite close to zero. So, we may infer that there is no significant evidence to reject the null hypothesis of a difference in performance. We may conclude that the placebo did not really work. ::: ::: {.callout-warning title="Worked Example Part II: (The Placebo)" icon="false"} A second group of students takes the same focus booster drink before writing a similar test. Again, the scores are recorded: - Before drinking; and - After drinking +--------+--------+--------+---------+-----------+-------+------+------+--------------+ | Person | Before | After | $d_{i}$ | $|d_{i}|$ | Order | Rank | Sign | Signed Ranks | +:======:+:======:+:======:+:=======:+:=========:+:=====:+:====:+:====:+:============:+ | **1** | $79.3$ | $82.5$ | $-3.2$ | $3.2$ | $1.1$ | $1$ | $+$ | $+1$ | +--------+--------+--------+---------+-----------+-------+------+------+--------------+ | **2** | $69.1$ | $68.0$ | $1.1$ | $1.1$ | $2$ | $2$ | $-$ | $-2$ | +--------+--------+--------+---------+-----------+-------+------+------+--------------+ | **3** | $85.4$ | $91.2$ | $-5.8$ | $5.8$ | $3.2$ | $3$ | $-$ | $-3$ | +--------+--------+--------+---------+-----------+-------+------+------+--------------+ | **4** | $73.0$ | $75.0$ | $-2$ | $2$ | $4.5$ | $4$ | $-$ | $-4$ | +--------+--------+--------+---------+-----------+-------+------+------+--------------+ | **5** | $83.8$ | $88.3$ | $-4.5$ | $4.5$ | $5.8$ | $5$ | $-$ | $-5$ | +--------+--------+--------+---------+-----------+-------+------+------+--------------+ | **6** | $62.8$ | $70.1$ | $-7.3$ | $7.3$ | $7.3$ | $6$ | $-$ | $-6$ | +--------+--------+--------+---------+-----------+-------+------+------+--------------+ And we find that our test statistic is $$ W=1-2-3-4-5-6=-19 $$ This differs vastly from zero, and tells us that the test scores after drinking the focus booster are bigger than the values before drinking the focus booseter. ::: ## 2.4. So, What is $W$ Really Measuring? $W$ tries to determine whether the signs are distributed randomly across the ranks, or is there a pattern/skew to one particular side. We have the following: - $W\approx 0 \implies \text{no evidence of a patern -- signs are random}$ - $W>>0 \implies \text{first sample (before) tends to be larger}$ - $W<<0 \implies \text{second sample (after) tends to be larger}$ So, in the test for a significant difference, we ask how extreme $W$ is under $H_{0}$, i.e., how different $W$ is from zero. ## 2.5. Sampling Distribution of $W$ ::: callout-note The **sampling distribution** is the distributionof a statistic (in this case, $W$) over all the possible samples (or outcomes) under a given assumption. For $W$, the sampling distribution is given by all the possible ways of assigning $+$ or $-$ signs to the ranks since we assume, under the null hypothesis, that the signs of the differences are random, i.e., $\text{median of differences}=0$ ::: Under $H_{0}$, the signs of the ranked differences are random, so each rank is equally likely to be positive or negative. It turns out that, for small sample sizes with no ties, it is possible to deduce the properties of the sampling distribution through simple enumeration of all possibilities ::: {.callout-warning title="Example"} Suppose that a data set has $3$ values. Then, we will get $3$ ranks from this data set. Let these ranks be $1,2, \text{ and }3$. Each rank can take on a $+$ or $-$ sign. So, the total number of combinations of $+$ and $-$ signs is going to be given by $$ 2^{3}=8 $$ The following table shows how we can get this: | 1 | 2 | 3 | W | |:---:|:---:|:---:|:----------:| | $+$ | $+$ | $+$ | $1+2+3=6$ | | $-$ | $+$ | $+$ | $-1+2+3=5$ | | $+$ | $-$ | $+$ | $+3$ | | $+$ | $+$ | $-$ | $0$ | | $-$ | $-$ | $+$ | $0$ | | $-$ | $+$ | $-$ | $-2$ | | $+$ | $-$ | $-$ | $-4$ | | $-$ | $-$ | $-$ | $-6$ | We can then take the proportion of each of the values for $W$ across the whole group to get the sampling distribution of $W$. We get the following: ```{r} ################################# # SAMPLING DISTRIBUTION OF W ################################# W <- c(-6, -4, -2, 0, 2, 4, 6) prob <- c(1, 1, 1, 2, 1, 1, 1) / 8 barplot(prob, names.arg = W, xlab = "W", ylab = "Proportion", main = "Proportion of Signed Differences") ``` ::: You can see how this can get out of hand for larger sample sizes since, say, $9$ ranks, will lead to $2^{9}=512$ values for $W$. In this case, it may be useful to use R. Here is an example of a code that may help you perform this: ```{r} ############################################## # SAMPLING DISTRIBUTION FOR W WITH 9 RANKS ############################################## # Function to compute sampling distribution of W wilcoxon_W_dist <- function(n) { ranks <- 1:n # Generate all ±1 combinations signs <- expand.grid(rep(list(c(-1, 1)), n)) # Compute W = sum of positive ranks W <- apply(signs, 1, function(s) sum(ranks[s == 1])) # Convert to probability distribution dist <- table(W) / length(W) return(dist) } # Case n = 9 dist9 <- wilcoxon_W_dist(9) # Plotting the distribution barplot(dist9, xlab = "W", ylab = "Proportion", main = "Sampling Distribution of W (n = 9)") ``` Notice that as we increase the number of ranks $n$ increases, the sampling distribution becomes more symmetric and smooth. This suggest that for larger values of $n$, the sampling distribution of $W$ resembles a normal distribution. In fact, for large sample sizes ($n>10$), the sampling distribution of $W$ can be approximated by a normal distribution with - a mean of $\mu_{W}=0$; and - a standard deviation of $\sigma_{W}=\displaystyle{\frac{n(n+1)(2n+1)}{6}}$ For this test, we can also have, either a (a) **two-sided test**, and we reject $H_{0}$ if $|z|>z_{\frac{\alpha}{2}}$; or (b) a **one-sided test**, and we reject $H_{0}$ if $z>z_{\alpha} \text{ (right-tailed)}$ or $z<-z_{\alpha} \text{ (left-tailed)}$ We could also use a $p$-value approach whereby we find the $p$-value corresponding to the calculated test statistic. In this case, we reject $H_{0}$ if $p<\alpha$, for a given significance level $\alpha$. ::: {.callout-warning title="Worked Example (A case of larger values of n)" icon="false"} In the following, we are trying to answer the question of whether a **"flexi-time"** work schedule helps to reduce the travel time of workers ***Note: Your brain should immediately be notifying you that this will be a one-sided test since we are looking for a reduction in the variable of interest*** A random sample of $32$ workers was selected, and workers recorded their time in minutes before and after the program was implemented. Using the **modified** $p$-**value approach**, test at the $5\%$ significance level. The full Wilcoxon table is given below ```{r} ######################### # DATA FOR FLEXI-TIME ######################### data <- data.frame( Worker = 1:32, normal_arrival = c(34,35,43,46,16,26,68,38,61,52,68,13,69,18,53,18, 41,25,17,26,44,30,19,48,29,24,51,40,26,20,19,42), Flextime = c(31,31,44,44,15,28,63,39,63,54,65,12,71,13,55,19, 38,23,14,21,40,33,18,51,33,21,50,38,22,19,21,38) ) library(dplyr) library(gt) data %>% mutate( difference = normal_arrival - Flextime, abs_difference = abs(difference) ) %>% filter(difference != 0) %>% mutate( rank = rank(abs_difference, ties.method = "average"), signed_rank = rank * sign(difference) ) %>% gt() ``` We have the following hypotheses: $H_{0}: \text{There is no difference in the travel time to work the normal and Flexi-time work programs}$ $H_{1}:\text{Workers take longer to travel to work in normal work hours}$ and we are given a significance level of $\alpha=0.05$ Here, $n=\text{number of non-zero differences}=32>10$. So, we can safely assume that the data will follow a normal distribution. We calculate the test statistic as $$ W=\sum_{i=1}^{32} \text{rank}(d_{i})\cdot \text{sgn}(d_{i})=207 $$ We can then calculate the $z$-score associated with this test statistic as $$ z=\frac{W-\mu_{W}}{\sigma_{W}}=\frac{207-0}{\sqrt{\frac{(32)(33)(65)}{6}}}=1.935 $$ ***Note: Under the assumption that*** $H_{0}$ ***is true, we have that*** $\mu_{W}=0$ We can then find the $p$-value of the test statistic. This is a left-handed test, since we were looking at a reduction in the time taken to arrive. So, we expect that $d>0$. Using this understanding, we can calculate the $p$-value using R ```{r} ############################# # FINDING P-VALUE ############################# p <- pnorm(1.935, lower.tail=F) p ``` **Conclusion:** Since the $p$-value is less than $0.05$, we reject the null hypothesis. We then conclude that there is significant evidence that workers take longer to travel in the normal work-hour program than they do with a Flexi-time schedule. The median difference is greater than zero. ::: # 3. Mann-Whitney-U Test ::: {.callout-important title="Key Idea"} The **Mann-Whitney-U Test** (or **U Test, Wilcoxon Rank Sum Test**, or just **Rank Sum Test**) is used to determine whether two independent samples of ordinal or quantitative data have the same central location (median). ::: This test is the equivalent of the $t$-test for two samples of normal data. ## 3.1. Data and Assumptions 1. We have two random samples of size $n_{1}$ and $n_{2}$ 2. The data are either ordinal or quantitative, but not normal 3. Samples and observations within samples are independent 4. The distributions of the two populations differ with respect to location only (if they differ at all) ## 3.2. Hypothesis Testing for Mann-Whitney-U Tests ### 3.2.1. Hypotheses We differentiate between one-sided and two-sided hypotheses. So: **For a two-sided test:** $$ H_{0}:\text{the two population [medians] are the same} $$ $$ \text{and} $$ $$ H_{1}: \text{the two population medians are different} $$ **For a one-sided test:** $$ H_{0}:\text{the two population [medians] are the same} $$ $$ \text{and} $$ $$ H_{1}: \text{the location of the first population is to the right of the second population} $$ $$ H_{1}:\text{the location of the first population is to the left of the second population} $$ ### 3.2.2. Calculating the Test Statistic The test statistic here depends on $n_{1}$ and $n_{2}$. We find it in the following way: 1. Combine the two samples into a single set of values 2. Rank all observations from the smallest to largest, i.e., from $1$ to $n_{1}+n_{2}$ 3. Calculate the sum of the ranks, $T_{1}=\text{sum of ranks for } n_{1}$ and $T_{2}=\text{sum of ranks for } n_{2}$ 4. We calculate two statistics: $$ U_{1}=T_{1}-\frac{n_{1}(n_{1}+1)}{2} $$ $$ \text{and} $$ $$ U_{2}=T_{2}-\frac{n_{2}(n_{2}+1)}{2} $$  5. The final test statistic is given by $$ U=\text{min}(U_{1}, U_{2}) $$ and we relate it to the specific $T$. So, if $\text{min}(U_{1},U_{2})=U_{1}$, then $T=T_{1}$ will be the test statistic. ### 3.2.3 Conclusion: The Logic If the locations of the two populations are are about the same, we would expect the sum of ranks $T_{1}$ and $T_{2}$ to be close, and therefore expect that the ranks are evenly spread between the samples. If $T_{1}$ is sufficiently small, then most of the smaller observations are in population $1$. We then conclude that the location of population $1$ is to the left of population $2$, and reject $H_{0}$. On the other hand, if $T_{1}$ is sufficiently larger, the most of the larger observations are in population $1$. We conclude, therefore, that the location of population $1$ is to the right of population $1$. ::: {.callout-warning title="Worked Example" icon="false"} Suppose we have the following samples: $\text{Sample 1}=\{0, 1, 1,0,1,2,1,2,3\}$ $\text{Sample 2}=\{7, 9, 10, 8, 10, 11, 10, 11, 12\}$ We can combine the two samples into one set of values and rank the new set of values. | | | | | | | | | | | | | | | | | | | |:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:| | **0** | **0** | **1** | **1** | **1** | **1** | **2** | **2** | **3** | 7 | 8 | 9 | 10 | 10 | 10 | 11 | 11 | 12 | | 1.5 | 1.5 | 4.5 | 4.5 | 4.5 | 4.5 | 7.5 | 7.5 | 9 | 10 | 11 | 12 | 14 | 14 | 14 | 16.5 | 16.5 | 18 | Without even calculating the test statistic, we can see that the ranks of sample 2 are much larger than those of sample 1. We can concretely show this. We get $$ T_{1}=45 \quad \text{and} \quad T_{2}=126 $$ and so we obtain that $$ U_{1}=0 \quad U_{2}=81 $$ Clearly, then, the test statistic is going to be $$ T=45 $$ since we have small sample sizes ($n_{1}, n_{2}<10$), we use the **Man-Whitney table** to find the rejection region. We use $\alpha=0$ for this case. ![](images/clipboard-2493410458.png){width="636"} We define $T_{L}$. This value can be obtained from the table with the appropriate $\alpha$ level, as the intersection of the two sample sizes (for small samples). In this case $$ T_{L}=63 $$ We define $T_{U}=n_{1}(n_{1}+n_{2}+1)-T_{L}$. In our case, we get that $$ T_{U}=(9)(9+9+1)-63=108 $$ We reject $H_{0}$ if $T \leq T_{L}$ or $T \geq T_{U}$. In our case, we reject the null hypothesis since $(T=45) \leq (T_{L}=63)$. We then conclude that the location of sample one is to the left of the location of sample two. Most of the observations in sample one are smaller than the observations in sample two. ::: For large sample sizes, whereby $n_{1} \text{ or } n_{2}$ are bigger than zero (in the inclusive sense), then the sample distribution of the test statistic can be approximated by a normal distribution. ::: {.callout-note title="Mann-Whitney-U Test for Large Sample Sizes"} Since $T$ is approximated by a normal distribution for large sample sizes, we can standardise $T$ to obtain a $z$ score: $$ z=\frac{T-\mu_{T}}{\sigma_{T}} $$ where $$ \mu_{T}=\frac{n_{1}(n_{1}+n_{2}+1)}{2} \quad \text{and} \quad \sigma_{T}=\sqrt{\frac{n_{1}n_{2}(n_{1}+n_{2}+1)}{12}} $$ Then, we reject the null hypothesis if: - $|z| \geq z_{\alpha/2}$ for a two-sided test - $z>z_{\alpha}$ for a right-tailed test - $z<-z_{\alpha}$ for a left-tailed test - $p \leq \alpha$ ::: Sometimes, the values of $n_{1}$ and $n_{2}$ are not going to match. Given the nature in which we calculate the test statistic for this test, this is not a problem. We still proceed as we have established. ::: {.callout-warning title="Worked Example Two"} The *ABC Company* has sent $13$ of its employees to a privately-ran programme providing word-processing skills training. Six of the employees were from the data-processing (DP) department, and the rest where from the Typing (T) pool. At the end of the programme, the company received a report indicating the score receieved by each of the employees out of a total possible score of $100$. We have the following: | DP | T | |:---:|:---:| | 70 | 59 | | 52 | 70 | | 46 | 75 | | 65 | 85 | | 60 | 50 | | 40 | 82 | | | 64 | **Is there a difference in the performance of the two groups in the word-processing programme? Test at a** $5\%$ **significance level.** We state the null and alternative hypotheses as $$ H_{0}: \text{There is no diifference in the performance between the two groups} $$ $$ \text{and} $$ $$ H_{1}:\text{There is a difference in performance between the two groups} $$ We are given that $\alpha=0.05$. We then combine the two samples for ranking. This gives us the following: | | | | | | | | | | | | | | | |:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:| | **Data** | **70** | **52** | **46** | **65** | **60** | **40** | 59 | 70 | 75 | 85 | 50 | 82 | 64 | | **Ordered** | **40** | **46** | 50 | **52** | 59 | **60** | 64 | **65** | 70 | **70** | 75 | 82 | 85 | | Rank | **1** | **2** | 3 | **4** | 5 | **6** | 7 | **8** | 9.5 | **9.5** | 11 | 12 | 13 | We find, from this, that $T_{1}=30.5$ and $T_{2}=60.5$. Then, $$ U_{1}=30.5-\frac{6(7)}{2}=9.5 \quad \text{and} \quad U_{2}=60.5-\frac{7(8)}{2}=32.5 $$ This gives the test statistic as $$ T=30.5 $$ To calculate the test statistic, we note that $n_{1}=6$ and $n_{2}=7$. Since these are both less than $10$, we can obtain $T_{L}$ using the **Mann-Whitney-U table**. We get that $$T_{L}=28$$Then, $$ T_{U}=6(6+7+1)-28=56 $$ **Conclusion: We find that** $T_{L} \leq T \leq T_{U}$**. So, we fail to reject the null hypothesis, and conclude that there is no evidence pf a significant difference in performance between the two groups in the word-processing skills training programme.** ::: ::: {.callout-warning title="Worked Example Three" icon="false"} A pharmaceutical company is planning to introduce a new painkiller. To determine the effectiveness of the drug in comparison to asprin, $30$ people were randomly selected. - $15$ people were given the new drug (Sample $1$) - $15$ people were given asprin (Sample $2$) Each participant was asked to indicate which one of the five statements best represented the effectiveness of the drug they took. The statements are as follows: The drug taken was... - $5$ Extremely effective - $4$ Quite effective - $3$ Somewhat effective - $2$ Slighly effective - $1$ Not effective ***Note: This is ordinal data*** The ratings were recorded as follows | New Drug | Asprin | |:--------:|:------:| | 3 | 4 | | 5 | 1 | | 4 | 3 | | 3 | 2 | | 2 | 4 | | 5 | 1 | | 1 | 3 | | 4 | 4 | | 5 | 2 | | 3 | 2 | | 3 | 2 | | 5 | 4 | | 5 | 3 | | 5 | 4 | | 4 | 5 | **At the** $5\%$ **significance level, is the new drug perceived to be more effective than asprin?** As usual, we start with the null and alternative hypotheses: $$ H_{0}:\text{there is no difference in the perceived effectiveness between the two painkillers} $$ $$ H_{1}: \text{there is a diffeences between the painkillers} $$ We are given a $5\%$ significance level. We notice that $n_{1},n_{2}>10$, and so the sampling distribution of the test statistic follows a normal distribution. We find the test statistic $T$ first: | | | | |:--------:|:-----------:|:--------:| | **Data** | **Ordered** | **Rank** | | **3** | **1** | **2** | | **5** | 1 | 2 | | **4** | 1 | 2 | | **3** | **2** | **6** | | **2** | 2 | 6 | | **5** | 2 | 6 | | **1** | 2 | 6 | | **4** | 2 | 6 | | **5** | **3** | **12** | | **3** | **3** | **12** | | **3** | **3** | **12** | | **5** | **3** | **12** | | **5** | 3 | 12 | | **5** | 3 | 12 | | **4** | 3 | 12 | | 4 | **4** | **19.5** | | 1 | **4** | **19.5** | | 3 | **4** | **19.5** | | 2 | 4 | 19.5 | | 4 | 4 | 19.5 | | 1 | 4 | 19.5 | | 3 | 4 | 19.5 | | 4 | 4 | 19.5 | | 2 | **5** | **27** | | 2 | **5** | **27** | | 2 | **5** | **27** | | 4 | **5** | **27** | | 3 | **5** | **27** | | 4 | **5** | **27** | | 5 | 5 | 27 | and so we obtain that $T_{1}=276.5$ and $T_{2}=188.5$. This gives us our test statistic as $$ T=276.5 $$ Before finding the $z$-score, we calculate $$ \mu_{T}=\frac{(15)(15+15+1)}{2}=232.5 $$ and $$ \sigma_{T}=\sqrt{\frac{(15)(15)(15+15+1)}{12}}\approx24.11 $$ and so we obtain that $$ z=\frac{276.5-232.5}{24.11}=1.82 $$ This is a one-sided test, and so we will reject $H_{0}$ if the test statistic is greater than ```{r} #################### # CRITICAL VALUE ################### zcrit <- qnorm(0.05, lower.tail=FALSE) zcrit ``` which it clearly is. We would also reject if the $p$-value is less that $0.05$ ```{r} ############# # P-VALUE ############ pval <- pnorm(1.82, lower.tail=FALSE) pval ``` which, again, it clearly is. **Conclusion: We reject** $H_{0}$ **and** **conclude that there is significant evidence that there is a difference in effectiveness between the two drugs. That is, The new drug performs better than asprin.** ::: # 4. Kruskal-Wallis Test ::: {.callout-important title="Key Idea"} A **Kruskal-Wallis test** is used when we want to compare two or more independent groups/samples of ordinal data or quantitative data with respect to their medians. ::: It is the equivalent of a **single factor ANOVA**. ## 4.1. Data and Assumptions 1. The data is either ordinal or quantitative, but not necessarily normal 2. The treatment levels and observations within each treatment level are independent 3. There are, at least, three observations per group/sample 4. The distributions of the groups differ with respect to their location (median) only, if they differ at all ## 4.2. Hypothesis Testing for the Kruskal-Wallis Test ### 4.2.1. Hypotheses We have the following: $$ H_{0}: \text{the locations of the $k$ populations (groups) are the same} $$ $$ H_{1}: \text{at least two populations differ} $$ ### 4.2.2. Calculating the Test Statistic 1. We combine the observations from all the $k$ groups to form one sample. This sample will have $n_{T}=\sum_{i=1}^{k}n_{j}$ observations. 2. Then, we rank the observations, averaging ranks for all tied observations 3. We calculate the sum of ranks, $T_{1}, T_{2},\dots,T_{k}$, for all the $k$ groups ::: callout-note As a consequence of this, we have that $$ \sum_{i=1}^{k}T_{i}=\frac{n_{T}(n_{T}+1)}{2} $$ ::: The test statistic is then given by $$ H=\left[\frac{12}{n_{T}(n_{T}+1)}\sum_{i=1}^{k}\left(\frac{T^{2}_{i}}{n_{i}}\right)\right]-3(n_{T}+1) $$ ::: callout-note If all the populations have the same location, i.e. $H_{0}$ is true, then the ranks should be evenly distributed among the $k$ samples and the $H$ statistic will be small. Here, "small" means "sufficiently close to zero" ::: ### 4.2.3. Critical Region When the sample sizes of the $k$ groups is at least three, the sampling distribution of $H$ is a **chi-squared distribution** with $k-1$ degrees of freedom. Thus, the test is one-sided, and we reject $H_{0}$ if $H$ is too large ($H \geq c$) for some critical value $c$, or if $p \leq \alpha$ for some defined significance level $\alpha$. ::: callout-note If you are wondering how we calculate the critical region for when $n_{i}<3$, **we don't**. The Kruskal-Wallis test is particularly defined for $n_{i} \geq 3$ for the $k$ groups. It so happens that the test statistic follows a chi-squared distribution for this. ::: ::: {.callout-warning title="Worked Example" icon="false"} A 24hr restaurant wanted to determine how customers rate three shifts with respect to speed of service. Three samples of $10$ customer response-cards were randomly selected, one sample from each shift, and customer ratings (from $1$ for "very slow" to $5$ for "very quick") were recorded. The **ranked data** was recorded in the following table | 4:00 - midd | midd - 8:00 | 8:00 - 4:00 | |:-----------:|:-----------:|:-----------:| | 4 (27) | 3 (16.5) | 3 (16.5) | | 4 (27) | 4 (27) | 1 (2) | | 3 (16.5) | 2 (6.5) | 3 (16.5) | | 4 (27) | 2 (6.5) | 2 (6.5) | | 3 (16.5) | 3 (16.5) | 1 (2) | | 3 (16.5) | 4 (27) | 3 (16.5) | | 3 (16.5) | 3 (16.5) | 4 (27) | | 3 (16.5) | 3 (16.5) | 2 (6.5) | | 2 (6.5) | 2 (6.5) | 4 (27) | | 3 (16.5) | 3 (16.5) | 1 (2) | **Can we conclude that customers perceive the speed of service to be different among the three shifts at a 5 percent significance level?** We have our hypotheses: $$ H_{0}: \text{there is no difference in perception of the speed of service} $$ $$ H_{1}: \text{there is a difference in the perception of the speed of service} $$\ From the table, we find that $$ T_{1}=186.5 \quad T_{2}=156 \quad T_{3}=122.5 $$ and we can calculate the test statistic as $$ H=\frac{12}{30(30+1)}\left(\frac{(186.5)^{2}}{10}+\frac{(156)^{2}}{10}+\frac{(122.5)^{2}}{10}\right)-3(30+1)=2.645 $$ we can calculate the critical region ```{r} ########## # CRIT ######### k <- 3 chi_crit <- qchisq(0.05, df=k-1, lower.tail=F) chi_crit ``` and the $p$-value ```{r} ############## # p-value ############## p <- pchisq(2.645, k-1, lower.tail=F) p ``` **Conclusion: In this case, we fail to reject the null hypothesis since our test statistic is not more extreme than the critical value, and** $p>0.05$**.** **We then conclude that there is no evidence of a difference in the perception of speed of service between the different shifts.** ::: # 5. Friedman Test ::: {.callout-important title="Key Idea"} A **Friedman test** is used when comapring more than two groups or samples of ordinal or quantitative data, using matched or blocked samples, with respect to their (median) locations. ::: A Friedman test is the equivalent of an **randomised block design two-way ANOVA without interactions** ## 5.1. Data and Assumptions 1. Data is either ordinal or quantitative, but not normal 2. The data comes from a blocked experiment with **b** blocks 3. The measurements **within a block** are **dependent** 4. The measurements **between blocks** are **independent** 5. No interaction between blocks and treatments ## 5.2. Hypothesis Testing for the Friedman Test Before going deep into how we perform a hypothesis test for the Friedman test, it is worth looking at the structure of the experiments for which the test is used to investigate. <div> Recall that **blocking** is introduced into an experiment to improve comaprison of the treatments by grouping the experimental units into blocks based on them being the same with regards to some characteristic. These blocks will have the same number of experimental units, each having the treatment occurring once. So, $$ \text{number of units in each block}=\text{number of treatments} $$ Here is an example of this: | Treatment | **Block 1** | Block 2 | Block 3 | Block 4 | |:---------:|:-----------:|:--------:|:--------:|:--------:| | 1 | $y_{11}$ | $y_{12}$ | $y_{13}$ | $y_{14}$ | | 2 | $y_{21}$ | $y_{22}$ | $y_{23}$ | $y_{24}$ | | 3 | $y_{31}$ | $y_{32}$ | $y_{33}$ | $y_{34}$ | | 4 | $y_{41}$ | $y_{42}$ | $y_{43}$ | $y_{44}$ | | 5 | $y_{51}$ | $y_{52}$ | $y_{53}$ | $y_{54}$ | </div> So, we will end up measuring whether the $k$ treatment groups differ in their median. ### 5.2.1. Hypotheses We have the following: $$ H_{0}: \text{the locations of the $k$ populations are the same} $$ $$ \text{and} $$ $$ H_{1}: \text{at least two population locations differ} $$ ::: callout-tip Remember to interpret your hypotheses based on the context of the question which you are trying to answer ::: ### 5.2.2. Calculating the Test Statistic 1. Rank the observations from smallest to largest within each block 2. Average ranks of tied observations within the same block 3. Calculate the rank sums $T_{1}, T_{2}, \dots, T_{k}$ for all the $k$ treatments The test statistic is then given by $$ F_{r}=\left[\frac{12}{b(k)(k+1)}\sum_{j=1}^{k}T_{j}^{2}\right]-3b(k+1) $$ where - $b$ is the number of blocks - $k$ is the number of treatments; and $F_{r}$ is the actual test statistic which has a **chi-squared** **distribution** (approximately) provided that $k \geq5$ or $b \geq 5$ with $k-1$ degrees of freedom We then reject the null hypothesis if $F_{r}$ is too large under the assumption of the null hypothesis :::: {.callout-warning title="Worked Example" icon="false"} Four managers evaluate applicants for a job in an accounting firm on several dimensions including academic credentials, previous work experience and personal suitability. Each manager then summarises the results and produces an evaluation of the candidates. There are $5$ possibilities: 1. The candidate is in the top $5\%$ of applicants 2. The candidate is in the top $10\%$ of applicants, but not the the top $5\%$ 3. The candidate is in the top $25\%$ of applicants, but not in the top $10\%$ 4. The candidate is in the top $50\%$ of applicants, but not in the top $25\%$ 5. The candidate is in the bottom $50\%$ of applicants Eight applicants were chosen at randomly selected, and their evaluations by the four managers were recorded. +-------------+-------------+-------------+-------------+-------------+ | Applicant | Manager 1 | Manager 2 | Manager 3 | Manager 4 | +:===========:+:===========:+:===========:+:===========:+:===========:+ | **1** | 2 | 1 | 2 | 2 | +-------------+-------------+-------------+-------------+-------------+ | **2** | 4 | 2 | 3 | 2 | +-------------+-------------+-------------+-------------+-------------+ | **3** | 2 | 2 | 2 | 3 | +-------------+-------------+-------------+-------------+-------------+ | **4** | 3 | 1 | 3 | 2 | +-------------+-------------+-------------+-------------+-------------+ | **5** | 3 | 2 | 3 | 5 | +-------------+-------------+-------------+-------------+-------------+ | **6** | 2 | 2 | 3 | 3 | +-------------+-------------+-------------+-------------+-------------+ | **7** | 4 | 1 | 5 | 5 | +-------------+-------------+-------------+-------------+-------------+ | **8** | 3 | 2 | 5 | 3 | +-------------+-------------+-------------+-------------+-------------+ **Can we say that there are differences in the way the managers evaluate candidates?** Here, we are trying to determine how getting scored by a particular manager affects where the applicants are placed in the candidacy groups. So, the treatments are the managers. The blocking factor are the applicants themselves since the treatments are applied to all the applicants. ::: callout.tip ***To find the treatments, always ask yourself, "What effect are we trying to measure?" Since we are trying to measure the effect that each manager has on the scoring, that is our treatment – the managers.*** ***Usually, then, the blocks will follow from this. However, you can ask yourself "What is being measured repeatedly for each treatment?"*** ::: Notice, also, that the observations within each block (the applicants) are dependent since they are measured on the same applicant. This makes sense since a stronger applicant is very likely to score higher across all groups. For, the hypotheses, we have $$ H_{0}: \text{there is no difference in the way that managers evaluate candidates} $$ $$ H_{1}: \text{there is a difference in the way that managers evaluate candidates} $$ To calculate the test statistic, we first rank within the blocks to obtain the sum of ranks. We have the following: +-------------+-------------+-------------+-------------+-------------+ | Applicant | Manager 1 | Manager 2 | Manager 3 | Manager 4 | +:===========:+:===========:+:===========:+:===========:+:===========:+ | **1** | 2 (3) | 1 (1) | 2 (3) | 2 (3) | +-------------+-------------+-------------+-------------+-------------+ | **2** | 4 (4) | 2 (1.5) | 3 (3) | 2 (1.5) | +-------------+-------------+-------------+-------------+-------------+ | **3** | 2 (2) | 2 (2) | 2 (2) | 3 (4) | +-------------+-------------+-------------+-------------+-------------+ | **4** | 3 (3.5) | 1 (1) | 3 (3.5) | 2 (2) | +-------------+-------------+-------------+-------------+-------------+ | **5** | 3 (2.5) | 2 (1) | 3 (2.5) | 5 (4) | +-------------+-------------+-------------+-------------+-------------+ | **6** | 2 (1.5) | 2 (1.5) | 3 (3) | 4 (4) | +-------------+-------------+-------------+-------------+-------------+ | **7** | 4 (2) | 1 (1) | 5 (3.5) | 5 (3.5) | +-------------+-------------+-------------+-------------+-------------+ | **8** | 3 (2.5) | 2 (1) | 5 (4) | 3 (2.5) | +-------------+-------------+-------------+-------------+-------------+ and we get the sum of ranks as $T_{1}=21$, $T_{2}=10$, $T_{3}=24.5$, and $T_{4}=24.5$. We can then calculate the test statistic. We obtain that $$ F_{r}=\left[\frac{12}{(8)(4)(4+1)}\left((21)^{2}+(10)^{2}+(24.5)^{2}+(24.5)^{2}\right)\right]-3(8)(4+1)=10.61 $$ We can find the critical value (and therefore the critical region) ```{r} #################### # CRITICAL VALUE #################### k <- 4 crit <- qchisq(0.05, df=k-1, lower.tail=F) crit ``` and the $p$-value associated with the test statistic. ```{r} ############ # P VALUE ############ p <- pchisq(10.61, df=k-1, lower.tail=F) p ``` **Conclusion: Based on the test statistic being more extreme than the critical value, and having a** $p$-**value less than** $0.05$**, we reject the null hypothesis and conclude that there is evidence of a difference in the way that the different managers evaluate the candidates.** :::: # 6. Spearman Rank Correlation Coefficient Test ::: {.callout-important title="Key Idea"} The **Spearman Rank Correlation Coefficient Test** is used to measure the association between two samples/variables of ordinal or quantitative data ::: This test is equivalent to the **Pearson's Correlation Coefficient Test** ## 6.1 Data and Assumptions 1. Both variables are, at least, ordinal (though, they may be quantitative), and at least one variable is not normal 2. There are a total of $n$ randomly selected paired observations ::: callout-note Sprearman's rank correlation coefficient is interpreted the same way as Pearson's correlation. That is, $$ -1 \leq r_{s} \leq 1 $$ and - $-1 \implies$ perfect negative relationship - $-0.5 \implies$ moderate negative relationship - $0 \implies$ no relationship - $0.5 \implies$ moderate positive relationship - $+1 \implies$ perfect positive relationship ::: ## 6.2. Hypothesis Testing for Spearman Rank Correlation Test ### 6.2.1. Hypotheses The null hypothesis is given by $$ H_{0}: \rho_{s}=0 \text{ (no association between the two variables in the underlying population)} $$ and the alternative hypotheses can either be one-sided or two-sided. For a two-sided alternative hypothesis, we have $$ H_{1}: \rho_{s} \neq 0 \text{ (there is an association between the two uvariables in the underlying population)} $$ and, for the one-sided alternative hypotheses, we have $$ H_{1}: \rho_{s}>0 \text{ (positive correlation)} $$ $$ \text{and} $$ $$ H_{1}: \rho_{s}<0 \text{ (negative correlation)} $$ ### 6.2.2. Calculating the Test Statistic To calculate the test statistic, we 1. Rank rhe populations separately 2. Calculate the difference, $d$, within each pair of ranks. So, $$ d_{i}=\text{rank}(x_{i})-\text{rank}(y_{i}) $$ 3. The test statistic is then given by $$ r_{s}=1-\frac{6\sum_{i=1}^{n}d^{2}_{i}}{n(n^{2}-1)} $$ where $n$ is the **number of pairs** of data For large samples ($n \geq 10$), the sampling distribution of the test statistic, $r_{s}$ is approximately normal, and the test $z$-score is given by $$ z=\frac{r_{s}-\mu_{r_{s}}}{\sigma_{r_{s}}} $$ where $\mu_{r_{s}} = 0$ under the assumption that $H_{0}$ is true and $\sigma_{r_{s}} = \sqrt{\frac{1}{n-1}}=\frac{1}{\sqrt{n-1}}$. From, this, we can simplify the $z$ calculation by observing that $$ z=r_{s}\sqrt{n-1} $$ under the assumption that $H_{0}$ is true. ### 6.2.3. Conclusion We the reject the null hypothesis if - $|z| \geq z_{\alpha/2}$ for a two-sided test - $z>z_{\alpha}$ for a right-tailed test; and - $z<-z_{\alpha}$ for a left-tailed test; **OR** - if the $p$-value is less than the defined $\alpha$ ::: {.callout-warning title="Worked Example" icon="false"} After several semesters without much success, Pat Statstud (a student in the lowest quarter of a statistics course) decided to try and improve his performance. Pat needed to know the secret of success for university students. After many hours of discussion with other more successful students, Pat postulated a rather radical theory: **the longer one studied, the better one’s grade**. To test the theory, Pat took a random sample of 35 students in an economics course and asked each to report the average amount of time he or she studied economics, and the final mark (out of 100) obtained (see results on next slide). **Test to determine whether grade and study time are positively related.** The ranked data is as follows. ```{r} ############################### # STUDY TIME VS MARK DATA ############################### library(dplyr) library(gt) # Left block left <- tibble( Time = c(30, 5, 36, 37, 32, 23, 34, 2, 34, 43, 34, 32, 30, 36, 40, 24, 0, 25), Rank_Time = c(17, 4, 30.5, 32, 22.5, 7, 28, 2.5, 28, 35, 28, 22.5, 17, 30.5, 34, 8.5, 1, 10.5), Mark = c(71, 30, 82, 98, 78, 73, 82, 25, 94, 99, 85, 74, 79, 82, 88, 55, 7, 62), Rank_Mark = c(9, 4, 17.5, 34, 14, 10.5, 17.5, 3, 32, 35, 22, 12, 15, 17.5, 26, 5, 1, 6) ) # Right block right <- tibble( Time = c(29, 21, 31, 30, 33, 30, 33, 22, 29, 24, 30, 2, 31, 33, 25, 38, 26), Rank_Time = c(13.5, 5, 20.5, 17, 25, 17, 25, 6, 13.5, 8.5, 17, 2.5, 20.5, 25, 10.5, 33, 12), Mark = c(91, 66, 66, 73, 90, 88, 91, 64, 83, 87, 96, 16, 84, 92, 82, 88, 75), Rank_Mark = c(29.5, 8, 23, 10.5, 28, 26, 29.5, 7, 20, 24, 33, 2, 21, 31, 17.5, 26, 13) ) # Combine and format data <- bind_rows(left, right) data %>% gt() %>% tab_header( title = "Study Time vs Marks Dataset" ) %>% fmt_number( columns = everything(), decimals = 1 ) %>% tab_options( table.font.size = "small" ) ``` We start with the null and alternative hypotheses. The null hypothesis is given by $$ H_{0}: \text{more time spent studying doesn't improve one's grade } (\rho_{s}=0) $$ and the alternative hypothesis is $$ H_{1}: \text{more time spent studying improvesone's grade} (\rho_{s}>0) $$ We will test at the $5\%$ significance level. To calculate the test statistic, we will need the differences. These are given in the table below. ```{r} ############################### # STUDY TIME VS MARK DATA ############################### library(dplyr) library(gt) # Left block left <- tibble( Time = c(30, 5, 36, 37, 32, 23, 34, 2, 34, 43, 34, 32, 30, 36, 40, 24, 0, 25), Rank_Time = c(17, 4, 30.5, 32, 22.5, 7, 28, 2.5, 28, 35, 28, 22.5, 17, 30.5, 34, 8.5, 1, 10.5), Mark = c(71, 30, 82, 98, 78, 73, 82, 25, 94, 99, 85, 74, 79, 82, 88, 55, 7, 62), Rank_Mark = c(9, 4, 17.5, 34, 14, 10.5, 17.5, 3, 32, 35, 22, 12, 15, 17.5, 26, 5, 1, 6) ) # Right block right <- tibble( Time = c(29, 21, 31, 30, 33, 30, 33, 22, 29, 24, 30, 2, 31, 33, 25, 38, 26), Rank_Time = c(13.5, 5, 20.5, 17, 25, 17, 25, 6, 13.5, 8.5, 17, 2.5, 20.5, 25, 10.5, 33, 12), Mark = c(91, 66, 66, 73, 90, 88, 91, 64, 83, 87, 96, 16, 84, 92, 82, 88, 75), Rank_Mark = c(29.5, 8, 23, 10.5, 28, 26, 29.5, 7, 20, 24, 33, 2, 21, 31, 17.5, 26, 13) ) # Combine and format data <- bind_rows(left, right) data %>% mutate( d_i = Rank_Time - Rank_Mark ) %>% gt() %>% tab_header( title = "Study Time vs Marks Dataset (with Differences)" ) %>% fmt_number( columns = everything(), decimals = 1 ) %>% tab_options( table.font.size = "small" ) ``` Now, we are ready to calculate the test statistic. $$ r_{s}=1-6\left[\frac{(8)^{2}+(0)^{2}+(13)^{2}+\dots+(7)^{2}+(-1)^{2}}{35((35)^{2}-1)}\right]\approx0.7251 $$ and the associated $z$-score will be $$ z=0.7251\sqrt{35-1}=4.228 $$ The critical region (and note, this is a one-sided test) is given by ```{r} #################### # CRITICAL POINT #################### zcrit <- qnorm(0.05, lower.tail=F) zcrit ``` $z \geq 1.645$, and the $p$-value for the test statistic is given by ```{r} ################## # P-VALUE ################# pv <- pnorm(4.228, lower.tail=F) pv ``` **Conclusion: We reject the null hypothesis since the test statistic falls into the rejection region, and the** $p$**-value is less than the significance level defined (**$\alpha=0.05$**). We then conclude that there is significant evidence of a positive relationship between the amount of time spent studying and the grade of a student.** ::: # 7. Advantages and Disadvantages of Non-Parametric Statistical Techniques ::: {.callout-important title="Advantages of Non-Parameteric Tests"} - Can be used when parametric techniques are not suited for the data samples given, and the validity of their assumptions is uncertain - Useful for small sample sizes - The assumptions are usually few and easily met - They are not just restricted to quantitative data ::: ::: {.callout-important title="Disadvantages of Non-Parametric Tests"} - Information is lost by ranking or taking signed ranks. As a result, we lose more **power** (the probability of rejecting the null hypothesis when it is, in fact, false) compared to the equivalent parametric tests (when one is appropriate for the data) :::

Person	Before	After	\(d_{i}\) (Differences)	\(\|d_{i}\|\) (Absolute Differences)	Ordered	Rank	Sign	Signed Ranks
1	\(80.0\)	\(78.6\)	\(+1.4\)	\(1.4\)	\(0.7\)	\(1\)	\(-\)	\(-1\)
2	\(73.5\)	\(76.0\)	\(-2.5\)	\(2.5\)	\(1.4\)	\(2\)	\(+\)	\(+2\)
3	\(85.0\)	\(81.2\)	\(+3.8\)	\(3.8\)	\(2.5\)	\(3\)	\(-\)	\(-3\)
4	\(69.0\)	\(74.1\)	\(-5.1\)	\(5.1\)	\(3.8\)	\(4\)	\(+\)	\(+4\)
5	\(77.8\)	\(78.5\)	\(-0.7\)	\(0.7\)	\(4.2\)	\(5\)	\(+\)	\(+5\)
6	\(90.2\)	\(86.0\)	\(+4.2\)	\(4.2\)	\(5.1\)	\(6\)	\(-\)	\(-6\)

Person	Before	After	\(d_{i}\)	\(\|d_{i}\|\)	Order	Rank	Sign	Signed Ranks
1	\(79.3\)	\(82.5\)	\(-3.2\)	\(3.2\)	\(1.1\)	\(1\)	\(+\)	\(+1\)
2	\(69.1\)	\(68.0\)	\(1.1\)	\(1.1\)	\(2\)	\(2\)	\(-\)	\(-2\)
3	\(85.4\)	\(91.2\)	\(-5.8\)	\(5.8\)	\(3.2\)	\(3\)	\(-\)	\(-3\)
4	\(73.0\)	\(75.0\)	\(-2\)	\(2\)	\(4.5\)	\(4\)	\(-\)	\(-4\)
5	\(83.8\)	\(88.3\)	\(-4.5\)	\(4.5\)	\(5.8\)	\(5\)	\(-\)	\(-5\)
6	\(62.8\)	\(70.1\)	\(-7.3\)	\(7.3\)	\(7.3\)	\(6\)	\(-\)	\(-6\)

1	2	3	W
\(+\)	\(+\)	\(+\)	\(1+2+3=6\)
\(-\)	\(+\)	\(+\)	\(-1+2+3=5\)
\(+\)	\(-\)	\(+\)	\(+3\)
\(+\)	\(+\)	\(-\)	\(0\)
\(-\)	\(-\)	\(+\)	\(0\)
\(-\)	\(+\)	\(-\)	\(-2\)
\(+\)	\(-\)	\(-\)	\(-4\)
\(-\)	\(-\)	\(-\)	\(-6\)

Treatment	Block 1	Block 2	Block 3	Block 4
1	\(y_{11}\)	\(y_{12}\)	\(y_{13}\)	\(y_{14}\)
2	\(y_{21}\)	\(y_{22}\)	\(y_{23}\)	\(y_{24}\)
3	\(y_{31}\)	\(y_{32}\)	\(y_{33}\)	\(y_{34}\)
4	\(y_{41}\)	\(y_{42}\)	\(y_{43}\)	\(y_{44}\)
5	\(y_{51}\)	\(y_{52}\)	\(y_{53}\)	\(y_{54}\)

1. Introduction

1.1. Parametric Techniques

1.2. Non-Parametric Techniques

1.3. Data Types

1.3.1. Qualitative (or Categorical) Data

1.3.2. Quantitative Data

1.4. Overview of Non-Parametric Tests

1.4.1. Single Population Tests

1.4.2. Two Population Tests

1.4.3. Three or More Population Tests

1.4.4. For Relationship Between Two Variables

1.5. Ranking Data

2. Wilcoxon Signed Rank Sum Test

2.1. Hypotheses

2.2. Data and Assumptions

2.3. Calculating the Test Statistic

2.4. So, What is \(W\) Really Measuring?

2.5. Sampling Distribution of \(W\)

3. Mann-Whitney-U Test

3.1. Data and Assumptions

3.2. Hypothesis Testing for Mann-Whitney-U Tests

3.2.1. Hypotheses

3.2.2. Calculating the Test Statistic

3.2.3 Conclusion: The Logic

4. Kruskal-Wallis Test

4.1. Data and Assumptions

4.2. Hypothesis Testing for the Kruskal-Wallis Test

4.2.1. Hypotheses

4.2.2. Calculating the Test Statistic

4.2.3. Critical Region

5. Friedman Test

5.1. Data and Assumptions

5.2. Hypothesis Testing for the Friedman Test

5.2.1. Hypotheses

5.2.2. Calculating the Test Statistic

6. Spearman Rank Correlation Coefficient Test

6.1 Data and Assumptions

6.2. Hypothesis Testing for Spearman Rank Correlation Test

6.2.1. Hypotheses

6.2.2. Calculating the Test Statistic

6.2.3. Conclusion

7. Advantages and Disadvantages of Non-Parametric Statistical Techniques