Handling violations of assumptions

M. Drew LaMar
April 1, 2016

Class announcements

  • Lab #8: Added DataCamp course (Chapter 2 of Cleaning Data)
  • Lab #8 new due date: Friday, April 8, 11:59 pm
  • No more DataCamp (stop paying if you want to) - Download slides first!!!
  • Future HWs will be due on Fridays now
  • Labs will be for going over HW and projects
  • Mea Culpa: Exam will be graded and returned on Monday

Handling violations of assumptions

Four options for handling violations of assumptions:

  • Ignore the violations of assumptions
  • Transform the data
  • Use a nonparametric method
  • Use a permutation test (computer-intensive methods)

Need to detect deviations first

To check for normality, first (as always) look at your data. Histograms work best here.

Detecting deviations from normality

The following data come from a normal distribution:

They don't look normal, but they:

  • …don't have outliers
  • …aren't skewed

Detecting deviations from normality

Examples of data from non-normal distributions:

Normal quantile plot

Definition: The normal quantile plot compares each observation in the sample with its quantile expected from the standard normal distribution. Points should fall roughly along a straight line if the data come from a normal distribution.

Normal quantile plot - R Example

  1. Sort measurements (\( x \))
  2. Compute percentiles of \( x \) (cumulative probabilities, \( p \))
  3. Compute standard normal quantiles from percentiles (\( q \))
  4. Plot measurements against computed quantiles (\( q \) vs \( x \))
x <- sort(rnorm(20))  # (1)
p <- (1:20)/21  # (2)
q <- qnorm(p, lower.tail = TRUE)  # (3)
plot(q ~ x, xlab="Measurements", ylab="Normal quantiles")  # (4)

Normal quantile plot - R Example

x <- sort(rnorm(20))  # (1)
p <- (1:20)/21  # (2)
q <- qnorm(p, lower.tail = TRUE)  # (3)
plot(q ~ x, xlab="Measurements", ylab="Normal quantiles")  # (4)

plot of chunk unnamed-chunk-3

Normal quantile plot - R Example

Fast way (note: axes are flipped by default!)

qqnorm(x, datax = TRUE)

plot of chunk unnamed-chunk-4

Marine reserve example

Question: Are marine reserves effective in preserving marine wildlife?

Experimental design

Halpern (2003) matched 32 marine reserves to a control location, which was either the site of the reserve before it became protected or a similar unprotected site nearby. They then evaluated the “biomass ratio,” which is the ratio of total masses of all marine plants and animals per unit area of reserve in the protected and matched unprotected areas.

Marine reserve example

Experimental design

Halpern (2003) matched 32 marine reserves to a control location, which was either the site of the reserve before it became protected or a similar unprotected site nearby. They then evaluated the “biomass ratio,” which is the ratio of total masses of all marine plants and animals per unit area of reserve in the protected and matched unprotected areas.

Discuss: Observational or experimental? Paired or unpaired? Interpret response measure in terms of effect of protection.

Answer: Observational. Paired (matching). Biomass ratio = 1 (no effect); > 1 (beneficial effect); < 1 (detrimental effect).

How to interpret normal quantile plots

How to interpret normal quantile plots

How to interpret normal quantile plots

Practice Problem #4: Interpret the following normal quantile plots.

Statistical test for normality??

Definition: A Shapiro-Wilk test evaluates the goodness of fit of a normal distribution to a set of data randomly sampled from a population.

\( H_{0} \): The data are sampled from a population having a normal distribution.
\( H_{A} \): The data are sampled from a population not having a normal distribution.

Cautions:

  • Small sample size might not have enough power.
  • Large sample size can have too much power (reject even when deviation from normality is very slight)

Shapiro-Wilk Test - R Example

marine <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter13/chap13e1MarineReserve.csv")
shapiro.test(marine$biomassRatio)

    Shapiro-Wilk normality test

data:  marine$biomassRatio
W = 0.81751, p-value = 8.851e-05

Conclusion: Combination of graphical, testing, and common sense.

When to ignore violation of assumptions

  • Ignore the violations of assumptions
  • Transform the data
  • Use a nonparametric method
  • Use a permutation test (computer-intensive methods)

Definition: A statistical procedure is robust if the answer it gives is not sensitive to violations of assumptions of the method.

Main takeaway point: This is a case-by-case basis that depends on the statistical test and data (see book for discussion).

Data transformations

  • Ignore the violations of assumptions
  • Transform the data
  • Use a nonparametric method
  • Use a permutation test (computer-intensive methods)

Definition: A data transformation changes each measurement by the same mathematical formula.

Data transformations

Common transformations:

  • Log transformation (data skewed right) \[ Y^{\prime} = \ln[Y] \]
  • Arcsine transformation (data are proportions) \[ p^{\prime} = \arcsin[\sqrt{p}] \]
  • Square-root transformation (data are counts) \[ Y^{\prime} = \sqrt{Y + 1/2} \]

Data transformations

Other transformations:

  • Square transformation (data skewed left) \[ Y^{\prime} = Y^2 \]
  • Antilog transformation (data skewed left) \[ Y^{\prime} = e^{Y} \]
  • Reciprocal transformation (data skewed right) \[ Y^{\prime} = \frac{1}{Y} \]
  • Box-Cox transformation (skew) \[ Y^{\prime}_{\lambda} = \frac{Y^{\lambda} - 1}{\lambda} \]

Log transformations - When to use

  • Measurements are ratios or products
  • Frequency distribution is skewed right
  • Group having larger mean also has larger standard deviation
  • Data span several orders of magnitude

Log transformations - When to use

  • Measurements are ratios or products
  • Frequency distribution is skewed right
  • Group having larger mean also has larger standard deviation
  • Data span several orders of magnitude

Log transformations - When to use

  • Measurements are ratios or products
  • Frequency distribution is skewed right
  • Group having larger mean also has larger standard deviation
  • Data span several orders of magnitude