My assumptions are violated! Part I

Alban Guillaumet, Troy University

Objective

Detecting deviations from Normality (I)
General options for handling violations of assumptions (II)

Detecting deviations from normality

To check for normality, first (as always) look at your data. Histograms work best here.

Detecting deviations from normality

The following data come from a normal distribution:

They don't look normal, but they:

…don't have outliers
…aren't strongly skewed

Detecting deviations from normality

Examples of data from non-normal distributions:

Normal quantile plot

Definition: The quantile of a measurement specifies the fraction of observations less than or equal to it. For instance, the first and third quartiles are the 0.25 and 0.75 quantiles, and the median is the 0.50 quantile.

Definition: The normal quantile plot compares each observation in the sample with its quantile expected from the standard normal distribution. Points should fall roughly along a straight line if the data come from a normal distribution.

Normal quantile plot - R Example

In R (note: axes are flipped by default!)

qqnorm(rnorm(100), datax = TRUE)

plot of chunk unnamed-chunk-3

Normal quantile plot - R Example

How this works ?

Sort measurements \( x \) and assign each a rank \( i \) from 1 to n.
The estimated proportion of the distribution lying below an observation ranked \( i \) is \( i/(n+1) \)
The corresponding normal quantile \( q \) is the standard normal deviate \( Z \) having an area under the standard normal curve = to \( i/(n+1) \)
Plot measurements against computed quantiles (\( q \) vs \( x \))

Normal quantile plot - R Example

How this works ?

For example, the normal scores when \( n = 5 \) are defined by the bins \( (-\infty, -0.97] \), \( (-0.97, -0.43] \), \( (-0.43, 0.00] \), \( (0.00, 0.43] \), \( [0.43, 0.97] \), \( (0.97, +\infty) \). Each of these bins has area \( 1/(5+1) = 0.1667 \). This means that the normal scores for a dataset with \( n = 5 \) are -0.97, -0.43, 0.00, 0.43, and 0.97.

Normal quantile plot - R Example

How this works ?

n <- 5; ( p <- (1:n)/(n+1) ); ( q <- qnorm(p, lower.tail = TRUE) )

[1] 0.1666667 0.3333333 0.5000000 0.6666667 0.8333333

[1] -0.9674216 -0.4307273  0.0000000  0.4307273  0.9674216

Normal quantile plot - R Example

x <- sort(rnorm(100))  # 1. Sort measurements x 
hist(x)

plot of chunk unnamed-chunk-5

Normal quantile plot - R Example

p <- (1:100)/101  # (2) Compute the estimated proportion of the distribution lying below an observation ranked i as i/(n+1)
q <- qnorm(p, lower.tail = TRUE)  # (3) Compute the corresponding normal quantiles  
plot(q ~ x, xlab="Sorted measurements", ylab="Normal quantiles"); abline(v = median(x), col = "red"); abline(h = 0, col = "blue")  # (4) Plot measurements against computed quantiles (q vs x)

plot of chunk unnamed-chunk-6

Normal quantile plot - R Example 2

x <- sort(rexp(100))  # (1)
hist(x); abline(v = median(x), col = "red")

plot of chunk unnamed-chunk-7

Normal quantile plot - R Example 2

p <- (1:100)/101  # (2)
q <- qnorm(p, lower.tail = TRUE)  # (3)
plot(q ~ x, xlab="Sorted measurements", ylab="Normal quantiles"); abline(v = median(x), col = "red"); abline(h = 0, col = "blue")  # (4)

plot of chunk unnamed-chunk-8

How to interpret normal quantile plots

Normal distribution?

Marine reserve example

Question: Are marine reserves effective in preserving marine wildlife?

Research design

Halpern (2003) matched 32 marine reserves to a control location, which was either the site of the reserve before it became protected or a similar unprotected site nearby. They then evaluated the “biomass ratio,” which is the ratio of total masses of all marine plants and animals per unit area of reserve divided by the same quantity in unprotected areas.

Marine reserve example

Research design

Halpern (2003) matched 32 marine reserves to a control location, which was either the site of the reserve before it became protected or a similar unprotected site nearby. They then evaluated the “biomass ratio,” which is the ratio of total masses of all marine plants and animals per unit area of reserve divided by the same quantity in unprotected areas.

Discuss: Observational or experimental? Paired or unpaired? Interpret response measure in terms of effect of protection.

Marine reserve example

Discuss: Observational or experimental? Paired or unpaired? Interpret response measure in terms of effect of protection.

Answer: Observational. Paired (matching). Biomass ratio = 1 (no effect); > 1 (beneficial effect); < 1 (detrimental effect).

Marine reserve example

Statistical test for normality?

Definition: A Shapiro-Wilk test evaluates the goodness of fit of a normal distribution to a set of data randomly sampled from a population.

\( H_{0} \): The data are sampled from a population having a normal distribution.
\( H_{A} \): The data are sampled from a population NOT having a normal distribution.

Caution:

small sample size might not yield enough power to reject \( H_{0} \)

large sample size may lead to reject \( H_{0} \) even when deviation from normality is very slight

Shapiro-Wilk Test - R Example

par(cex.lab = 1.5)
marine <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter13/chap13e1MarineReserve.csv")
hist(marine$biomassRatio)

plot of chunk unnamed-chunk-9

Shapiro-Wilk Test - R Example

shapiro.test(marine$biomassRatio)


    Shapiro-Wilk normality test

data:  marine$biomassRatio
W = 0.81751, p-value = 8.851e-05