My assumptions are violated!!

M. Drew LaMar
November 14, 2022

Class Announcements

Homework #7 is posted (Chapters 10-12)
- ~~Due Monday, November 21, 11:59 pm~~
Reading Assignment for Wednesday (~~NO QUIZ~~)
- Whitlock & Schluter, Chapter 13: Handling violations of assumptions

Handling violations of assumptions

Four options for handling violations of assumptions:

Ignore the violations of assumptions
Transform the data
Use a nonparametric method
Use a permutation test (computer-intensive methods)

~~Need to detect deviations first~~

To check for normality, first (as always) look at your data. Histograms work best here.

Detecting deviations from normality

The following data come from a normal distribution:

They don't look normal, but they:

…don't have outliers
…aren't skewed

Detecting deviations from normality

Examples of data from non-normal distributions:

Normal quantile plot

Definition: The normal quantile plot compares each observation in the sample with its quantile expected from the standard normal distribution. Points should fall roughly along a straight line if the data come from a normal distribution.

Normal quantile plot - R Example

Sort measurements (\( x \))
Compute percentiles of \( x \) (cumulative probabilities, \( p \))
Compute standard normal quantiles from percentiles (\( q \))
Plot measurements against computed quantiles (\( q \) vs \( x \))

x <- sort(rnorm(20))  # (1)
p <- (1:20)/21  # (2)
q <- qnorm(p, lower.tail = TRUE)  # (3)
plot(q ~ x, xlab="Measurements", ylab="Normal quantiles")  # (4)

Normal quantile plot - R Example

x <- sort(rnorm(20))  # (1)
p <- (1:20)/21  # (2)
q <- qnorm(p, lower.tail = TRUE)  # (3)
plot(q ~ x, xlab="Measurements", ylab="Normal quantiles")  # (4)

plot of chunk unnamed-chunk-3

Normal quantile plot - R Example

Fast way (note: axes are flipped by default!)

qqnorm(x, datax = TRUE)

plot of chunk unnamed-chunk-4

Marine reserve example

Question: Are marine reserves effective in preserving marine wildlife?

Experimental design

Halpern (2003) matched 32 marine reserves to a control location, which was either the site of the reserve before it became protected or a similar unprotected site nearby. They then evaluated the “biomass ratio,” which is the ratio of total masses of all marine plants and animals per unit area of reserve in the protected and matched unprotected areas.

Marine reserve example

Experimental design

Halpern (2003) matched 32 marine reserves to a control location, which was either the site of the reserve before it became protected or a similar unprotected site nearby. They then evaluated the “biomass ratio,” which is the ratio of total masses of all marine plants and animals per unit area of reserve in the protected and matched unprotected areas.

Discuss: Observational or experimental? Paired or unpaired? Interpret response measure in terms of effect of protection.

Answer: Observational. Paired (matching). Biomass ratio = 1 (no effect); > 1 (beneficial effect); < 1 (detrimental effect).

How to interpret normal quantile plots

Practice Problem #4: Interpret the following normal quantile plots.

Statistical test for normality??

Definition: A Shapiro-Wilk test evaluates the goodness of fit of a normal distribution to a set of data randomly sampled from a population.

\( H_{0} \): The data are sampled from a population having a normal distribution.
\( H_{A} \): The data are sampled from a population not having a normal distribution.

Cautions:

Small sample size might not have enough power.
Large sample size can have too much power (reject even when deviation from normality is very slight)

Shapiro-Wilk Test - R Example

marine <- read.csv("/Users/mdlama/Dropbox/Work/Teaching/College of William and Mary/Fall 2018/Datasets/chapter13/chap13e1MarineReserve.csv")
hist(marine$biomassRatio)

plot of chunk unnamed-chunk-5

Shapiro-Wilk Test - R Example

marine <- read.csv("/Users/mdlama/Dropbox/Work/Teaching/College of William and Mary/Fall 2018/Datasets/chapter13/chap13e1MarineReserve.csv")
shapiro.test(marine$biomassRatio)


    Shapiro-Wilk normality test

data:  marine$biomassRatio
W = 0.81751, p-value = 8.851e-05

Conclusion: Combination of graphical, testing, and common sense.

When to ignore violation of assumptions

~~Ignore the violations of assumptions~~
Transform the data
Use a nonparametric method
Use a permutation test (computer-intensive methods)

Definition: A statistical procedure is robust if the answer it gives is not sensitive to violations of assumptions of the method.

Main takeaway point: This is a case-by-case basis that depends on the statistical test and data (see book for discussion).

Data transformations

Ignore the violations of assumptions
~~Transform the data~~
Use a nonparametric method
Use a permutation test (computer-intensive methods)

Definition: A data transformation changes each measurement by the same mathematical formula.

Data transformations

Common transformations:

Log transformation (data skewed right) \[ Y^{\prime} = \ln[Y] \]
Arcsine transformation (data are proportions) \[ p^{\prime} = \arcsin[\sqrt{p}] \]
Square-root transformation (data are counts) \[ Y^{\prime} = \sqrt{Y + 1/2} \]

Data transformations

Other transformations:

Square transformation (data skewed left) \[ Y^{\prime} = Y^2 \]
Antilog transformation (data skewed left) \[ Y^{\prime} = e^{Y} \]
Reciprocal transformation (data skewed right) \[ Y^{\prime} = \frac{1}{Y} \]
Box-Cox transformation (skew) (Note: \( Y > 0 \)) \[ Y^{\prime}_{\lambda} = \left\{\begin{array}{ll}\frac{Y^{\lambda} - 1}{\lambda}, & \mathrm{if} \ \lambda \neq 0 \\ \log(Y), & \mathrm{if} \ \lambda = 0\end{array}\right. \]

Log transformations - When to use

Measurements are ratios or products
Frequency distribution is skewed right
Group having larger mean also has larger standard deviation
Data span several orders of magnitude

Log transformations - When to use

Measurements are ratios or products
Frequency distribution is skewed right
~~Group having larger mean also has larger standard deviation~~
Data span several orders of magnitude

Log transformations - When to use

Measurements are ratios or products
Frequency distribution is skewed right
~~Group having larger mean also has larger standard deviation~~
Data span several orders of magnitude

Log transformations - How to use

Hypothesis testing

marine <- read.csv("/Users/mdlama/Dropbox/Work/Teaching/College of William and Mary/Fall 2018/Datasets/chapter13/chap13e1MarineReserve.csv")
shapiro.test(log(marine$biomassRatio))


    Shapiro-Wilk normality test

data:  log(marine$biomassRatio)
W = 0.93795, p-value = 0.06551

Log transformations - How to use

Hypothesis testing

hist(log(marine$biomassRatio))

plot of chunk unnamed-chunk-8

Log transformations - How to use

Original statistical hypotheses:

\( H_{0} \): The mean of the biomass ratio of marine reserves is one (\( \mu = 1 \))
\( H_{A} \): The mean of the biomass ratio of marine reserves is not one (\( \mu \neq 1 \))

Transformed statistical hypotheses:

\( H_{0} \): The mean of the log biomass ratio of marine reserves is zero (\( \mu^{\prime} = 0 \))
\( H_{A} \): The mean of the log biomass ratio of marine reserves is not zero (\( \mu^{\prime} \neq 0 \))

Log transformations - How to use

t.test(log(marine$biomassRatio), mu=0)


    One Sample t-test

data:  log(marine$biomassRatio)
t = 7.3968, df = 31, p-value = 2.494e-08
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 0.3470180 0.6112365
sample estimates:
mean of x 
0.4791272

Log transformations - How to use

Estimation

The 95% confidence interval for the log transformed data is

\[ 0.347 < \mu^{\prime} < 0.611. \]

For a 95% confidence interval of the untransformed data, we have

\[ e^{0.347} < \mathrm{geometric \ mean} < e^{0.611}, \]

\[ 1.41 < \mathrm{geometric \ mean} < 1.84. \]

Discuss: Conclusion?

Data transformations - Caveats

Be careful of sign of your data!!! (i.e. positives and negatives).
Avoid multiple testing with transformations (i.e. use all transformations and choose one that gives significant result)

Use a nonparametric method

Ignore the violations of assumptions
Transform the data
~~Use a nonparametric method~~
Use a permutation test (computer-intensive methods)

Definition: A nonparametric method makes fewer assumptions than standard parametric methods do about the distribution of the variables.

Property: Nonparametric methods are usually based on the ranks of the data points (medians, quartiles, etc.)

Property: Nonparametric tests are typically less powerful than parametric tests.

Use a nonparametric method

A nonparametric alternative to the one-sample \( t \)-test is the sign test.

Definition: The sign test compares the median of a sample to a constant specified in the null hypothesis. It makes no assumptions about the distribution of the measurements in the population.

A nonparametric alternative to the two-sample \( t \)-test is the Mann-Whitney \( U \)-test.

Definition: The Mann-Whitney \( U \)-test compares the distributions of two groups. It does not require as many assumptions as the two-sample \( t \)-test.

Sign test: Binomial test in disguise

Algorithm:

First, state a null hypothesized median.
Label all measurements larger than this median with a “\( + \)”, and all measurements smaller than this median with a “\( - \)”.
Throw out any measurements exactly equal to the median (sample size is reduced by this amount)
Use binomial test with the test statistic the number of “\( + \)” values (or \( - \) values), comparing the result to the null proportion \( p_{0}=0.5 \).

Sign test has very little power. If \( n \leq 5 \), then can't use sign test.

Example: Rainforests

Assignment problem #25

Researchers have observed that rainforest areas next to clear-cuts (less than 100 meters away) have a reduced tree biomass compared to rainforest areas far from clear-cuts. To go further, Laurance et al. (1997) tested whether rainforest areas more distant from the clear-cuts were also affected. They compiled data on the biomass change after clear-cutting (in tons/hectare/year) for 36 rainforest areas between 100m and several kilometers from clear-cuts.

Example: Rainforests

Look at data

hist(clearcuts$biomassChange)

plot of chunk unnamed-chunk-11

Example: Transformations?

hist(exp(clearcuts$biomassChange), main="Exponential transformation")

plot of chunk unnamed-chunk-12

Use sign test

\( H_{0} \): The median change in biomass is zero.
\( H_{A} \): The median change in biomass is not zero.

# Any biomass equal to zero?
sum(clearcuts$biomassChange == 0)

[1] 0

# How many plots have positive change in biomass?
(X <- sum(clearcuts$biomassChange > 0))

[1] 21

Use sign test

# Perform binomial test
binom.test(X, n=36, p=0.5)


    Exact binomial test

data:  X and 36
number of successes = 21, number of trials = 36, p-value = 0.405
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.4075652 0.7448590
sample estimates:
probability of success 
             0.5833333

Nonparametric two-sample t-test

A nonparametric alternative to the two-sample \( t \)-test is the Mann-Whitney \( U \)-test.

Definition: The Mann-Whitney \( U \)-test compares the distributions of two groups. It does not require as many assumptions as the two-sample \( t \)-test.

Example: Autoimmune diseases and gut microbes

Assignment Problem #32

T and B lymphocytes are normal components of the immune system, but in multiple sclerosis they become autoreactive and attack the central nervous system. What triggers the autoimmune process? One hypothesis is that the disease is initiated by environmental factors, especially microbial infection. However, recent work by Berer et al. (2011) on the mouse model of the disease suggests that the autoimmune process is triggered by nonpathogenic microbes living in the gut.

Example: Autoimmune diseases and gut microbes

They compared onset of autoimmune encephalomyelitis in two treatment groups of mice from a strain that carries transgenic human CD4\( ^{+} \) T cells, which initiate the disease. One group (GF) was kept free of nonpathogenic gut microbes and all pathogens. The other (SPF) was only pathogen-free and served as controls. They measured percentage of T cells producing the molecule, interleukin-17, in tissue samples from 16 mice in the two groups.

Example: Autoimmune and gut microbes

Look at the data

  treatment percentInterleukin17
1       SPF                18.87
2       SPF                15.65
3       SPF                13.45
4       SPF                12.95
5       SPF                 6.01
6       SPF                 5.84

Discuss: Is this data in tidy or messy format?

Answer: Tidy

Example: Autoimmune and gut microbes

Look at data

plot of chunk unnamed-chunk-16

Discuss: Discuss the data with respect to meeting assumptions of statistical tests.

Example: Autoimmune and gut microbes

Look at data

plot of chunk ggplot

Example: Autoimmune and gut microbes

Look at data

mydata %>% 
  ggplot(aes(x = percentInterleukin17)) +
  geom_histogram(binwidth=4) +
  facet_grid(treatment ~ .) +
  xlab("Percent interleukin-17") + 
  ylab("Frequency")

Example: Autoimmune and gut microbes

Mann-Whitney \( U \)-test (Wilcoxon rank-sum test)

\( H_{0} \): The distribution of interleukin-17 is the same in the two groups.
\( H_{A} \): The distribution of interleukin-17 is NOT the same in the two groups.

wilcox.test(percentInterleukin17 ~ treatment, data = mydata)

Example: Autoimmune and gut microbes

Mann-Whitney \( U \)-test (Wilcoxon rank-sum test)


    Wilcoxon rank sum exact test

data:  percentInterleukin17 by treatment
W = 6, p-value = 0.004662
alternative hypothesis: true location shift is not equal to 0

Conclusion?

Assumptions of Mann-Whitney U-test

The Mann-Whitney \( U \)-test tests if the distributions are the same.

If the distributions of the two groups have the same shape (same variance and skew), then the Mann-Whitney \( U \)-test can be used to compare the locations (means or medians) of the two groups (see Example 13.5).

~~It is for this reason that this test gets misused a lot in the literature.~~