Non-parametric tests for difference

Despite the importance of the normal distribution, you will often come across data that is not normally distributed. There can be many reasons for this. We need to be able to spot non-normal data when we have it, and then use the appropriate statistical test.

Remember that many statistical tests only work with normally distributed data (or, more generally, with data whose distribution has a well-known mathematical form). The tests used in this case are often called parametric tests. The parameters referred to are, for example the mean and standard deviation of the underlying normal distribution.

Where our data are not normally distributed, all is not lost. We use a suitable non-parametric test to determine whether data sets are different from each other or correlated with each other.

Reason for data not being normally distributed.

Remember the chracteristics of the normal distribution:

it is centred at a mean value \(\mu\)
it has a width determined by its standard deviation \(\st.dev\)
68% of data are within one standard deviation of the mean
95% of data are within two standard deviations of the mean
99.7% of data are within three standard deviations of the mean
it is symmetric
it requires constant interval data (ie numerical data such as height, length, temperature)

Sometimes, we have data that do not or cannot meet these criteria.

1. Data that is highly skewed

Sometimes we will gather data that is skewed left or skewed right. That is, when we plot the data as a histogram we do not get an approximation to a symmetric distribution, but one that is asymmetric, sometimes markedly so.

A common case is right (positively) skewed data, where some of the data are much larger than the bulk of the data, giving a histogram of the data with a long tail out to the right. This often occurs where there is a natural lower bound to the data, such as zero, so that negative values are impossible, but no upper bound.

Wind speeds and wave heights are well-known examples of this.

Here is some simulated wave height data as would be taken from a bouy in the North Atlantic. Each data point represents the significant wave height for a 10 minute period.

set.seed(25)
wave<-tibble(swh=rweibull(4464,2,1.5))

ggplot(wave,aes(x=swh)) +
  geom_histogram(fill="grey80",colour="grey50",binwidth=0.2) +
  labs(x="Significant wave height (m)",y="Count") +
  theme_cowplot()

The distribution is positively skewed. A qqplot of this data would be curved

ggplot(wave,aes(sample = swh))+
  stat_qq() +
  stat_qq_line() +
  theme_cowplot()

and a normality test wuch as the Shapiro-Wilk test would give a tiny p-value. Both of these indicate that this data is not normally distributed.


shapiro.test(wave$swh)
## 
##  Shapiro-Wilk normality test
## 
## data:  wave$swh
## W = 0.97167, p-value < 2.2e-16

We would get similar data for any other month. Hence if we wanted to compare the wave heights of two different months, we could not use a t-test.

In this case, although the data do not follow a normal distribution, they are constant interval data and do follow another well-known distribution, (known as a Weibull distribution). This sort of data can be analysed using non-parametric techniques, but it is also possible to use a powerful set of parametric tools known as Generalised Linear Models, but these are beyond the scope of this course.

2. Data that is bounded.

Remember that normal distributions stretch out symmetrically either side of their mean, in principle out to infinity in each direction, although in practice not more than three standard deviations either way. That means you should take care if your data is clustered near some natural boundary, beyond which it makes no sense for it to go.

A common case is data clustered near zero, where negative values make no sense.

For example suppose you were counting instances of different behaviours of primates in a zoo enclosure. For 30 days in a row, just after the morning feeding time, you spent 20 minutes at the enclosure and counted instances of allogrooming and foraging behaviours.

This what you got, plotted as a box plot:

lagf<-read_csv(here("data","lagf.csv"))

lagf %>% filter(behaviour %in% c("AG","F")) %>%
ggplot(aes(x=behaviour,y=count,fill=behaviour)) +
  geom_boxplot() +
  theme_cowplot() +
  theme(legend.position="NONE")

Note that both the sets of counts are bounded beneath by zero. A negative count makes no sense. Note too that the median count value for allogrooming was actually zero- on at least half the days when observations were made, no allogrooming was observed.

If we wanted to ask whether foraging and allogrooming occurred equally often, we could not use a -t-test, since these data are clearly not normally distributed.

A normality test for each group confirms this.


lagf %>% filter(behaviour %in% c("AG","F")) %>%
  group_by(behaviour) %>%
  summarise(shapiro.test(count)$p.value)

Remember that the null hypothesis of the Shapiro-Wilk test is that the data are normally distributed, so a tiny p-value is evidence that they are not.

A qqplot for the counts of each behaviour tells the same story, particularly for the allogrooming data.

lagf %>% filter(behaviour %in% c("AG","F")) %>%
ggplot(aes(sample=count,colour=behaviour)) +
    stat_qq() + 
    stat_qq_line() +
    facet_wrap(~behaviour) +
    theme_bw() +
   theme(legend.position="NONE")

3. Data that are not constant interval data

If your data are ordinal, but not constant interval, then you cannot (or at least you should not!) use a t-test to investigate differences.

Common examples of this are the Ballantyne scale for shore exposure, or abundance scales such as the SACFOR scale, or any set of responses to LIKERT type survey questions, or pain scores in a clinical setting, and so on.

So what do we do when we want to detect a difference, but cannot use a t-test?

When our data cannot reasonably be described as being normally distributed then we can instead use a so-called non-parametric test instead. The most commonly used of these is the Mann-Whitney test, which R confusingly calls the Wilcox test.

So if we were comparing the allogrooming and foraging counts, and wanted to see if there was a difference between them, we could write:

allogrooming<-filter(lagf,behaviour=="AG")
forage<-filter(lagf,behaviour=="F")

wilcox.test(allogrooming$count,forage$count)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  allogrooming$count and forage$count
## W = 994.5, p-value = 0.00767
## alternative hypothesis: true location shift is not equal to 0

The reported p-value is less than 0.05, in fact it is less than 0.01, so we can reject the null hypothesis that the freqencies of each behaviour are the same.

We would report this as :

The frequencies of allogrooming and foraging behaviours were not found to be the same (Mann-Whitney, W = 994.5, p<0.01)