M. Drew LaMar
November 8, 2021
Four options for handling violations of assumptions:
Need to detect deviations first
To check for normality, first (as always) look at your data. Histograms work best here.
The following data come from a normal distribution:
They don't look normal, but they:
Examples of data from non-normal distributions:
Definition: The
normal quantile plot compares each observation in the sample with its quantile expected from the standard normal distribution. Points should fall roughly along a straight line if the data come from a normal distribution.
x <- sort(rnorm(20)) # (1)
p <- (1:20)/21 # (2)
q <- qnorm(p, lower.tail = TRUE) # (3)
plot(q ~ x, xlab="Measurements", ylab="Normal quantiles") # (4)
x <- sort(rnorm(20)) # (1)
p <- (1:20)/21 # (2)
q <- qnorm(p, lower.tail = TRUE) # (3)
plot(q ~ x, xlab="Measurements", ylab="Normal quantiles") # (4)
Fast way (note: axes are flipped by default!)
qqnorm(x, datax = TRUE)
Question: Are marine reserves effective in preserving marine wildlife?
Experimental design
Halpern (2003) matched 32 marine reserves to a control location, which was either the site of the reserve before it became protected or a similar unprotected site nearby. They then evaluated the “biomass ratio,” which is the ratio of total masses of all marine plants and animals per unit area of reserve in the protected and matched unprotected areas.
Experimental design
Halpern (2003) matched 32 marine reserves to a control location, which was either the site of the reserve before it became protected or a similar unprotected site nearby. They then evaluated the “biomass ratio,” which is the ratio of total masses of all marine plants and animals per unit area of reserve in the protected and matched unprotected areas.
Discuss: Observational or experimental? Paired or unpaired? Interpret response measure in terms of effect of protection.
Answer: Observational. Paired (matching). Biomass ratio = 1 (no effect); > 1 (beneficial effect); < 1 (detrimental effect).
Practice Problem #4: Interpret the following normal quantile plots.
Definition: A
Shapiro-Wilk test evaluates the goodness of fit of a normal distribution to a set of data randomly sampled from a population.
\( H_{0} \): The data are sampled from a population having a normal distribution.
\( H_{A} \): The data are sampled from a population not having a normal distribution.
Cautions:
marine <- read.csv("/Users/mdlama/Dropbox/Work/Teaching/College of William and Mary/Fall 2018/Datasets/chapter13/chap13e1MarineReserve.csv")
hist(marine$biomassRatio)
marine <- read.csv("/Users/mdlama/Dropbox/Work/Teaching/College of William and Mary/Fall 2018/Datasets/chapter13/chap13e1MarineReserve.csv")
shapiro.test(marine$biomassRatio)
Shapiro-Wilk normality test
data: marine$biomassRatio
W = 0.81751, p-value = 8.851e-05
Conclusion: Combination of graphical, testing, and common sense.
Definition: A statistical procedure is
robust if the answer it gives is not sensitive to violations of assumptions of the method.
Main takeaway point: This is a case-by-case basis that depends on the statistical test and data (see book for discussion).
Definition: A
data transformation changes each measurement by the same mathematical formula.
Common transformations:
Other transformations:
Hypothesis testing
marine <- read.csv("/Users/mdlama/Dropbox/Work/Teaching/College of William and Mary/Fall 2018/Datasets/chapter13/chap13e1MarineReserve.csv")
shapiro.test(log(marine$biomassRatio))
Shapiro-Wilk normality test
data: log(marine$biomassRatio)
W = 0.93795, p-value = 0.06551
Hypothesis testing
hist(log(marine$biomassRatio))
Original statistical hypotheses:
\( H_{0} \): The mean of the biomass ratio of marine reserves is one (\( \mu = 1 \))
\( H_{A} \): The mean of the biomass ratio of marine reserves is not one (\( \mu \neq 1 \))
Transformed statistical hypotheses:
\( H_{0} \): The mean of the log biomass ratio of marine reserves is zero (\( \mu^{\prime} = 0 \))
\( H_{A} \): The mean of the log biomass ratio of marine reserves is not zero (\( \mu^{\prime} \neq 0 \))
t.test(log(marine$biomassRatio), mu=0)
One Sample t-test
data: log(marine$biomassRatio)
t = 7.3968, df = 31, p-value = 2.494e-08
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.3470180 0.6112365
sample estimates:
mean of x
0.4791272
Estimation
The 95% confidence interval for the log transformed data is
\[ 0.347 < \mu^{\prime} < 0.611. \]
For a 95% confidence interval of the untransformed data, we have
\[ e^{0.347} < \mathrm{geometric \ mean} < e^{0.611}, \]
or
\[ 1.41 < \mathrm{geometric \ mean} < 1.84. \]
Discuss: Conclusion?
Definition: A
nonparametric method makes fewer assumptions than standardparametric methods do about the distribution of the variables.
Property: Nonparametric methods are usually based on the
ranks of the data points (medians, quartiles, etc.)
Property: Nonparametric tests are typically
less powerful than parametric tests.
Definition: The
sign test compares the median of a sample to a constant specified in the null hypothesis. It makes no assumptions about the distribution of the measurements in the population.
Definition: The
Mann-Whitney \( U \)-test compares the distributions of two groups. It does not require as many assumptions as the two-sample \( t \)-test.
Algorithm:
Sign test has very little power. If \( n \leq 5 \), then can't use sign test.
Assignment problem #25
Researchers have observed that rainforest areas next to clear-cuts (less than 100 meters away) have a reduced tree biomass compared to rainforest areas far from clear-cuts. To go further, Laurance et al. (1997) tested whether rainforest areas more distant from the clear-cuts were also affected. They compiled data on the biomass change after clear-cutting (in tons/hectare/year) for 36 rainforest areas between 100m and several kilometers from clear-cuts.
Look at data
hist(clearcuts$biomassChange)
hist(exp(clearcuts$biomassChange), main="Exponential transformation")
\( H_{0} \): The median change in biomass is zero.
\( H_{A} \): The median change in biomass is not zero.
# Any biomass equal to zero?
sum(clearcuts$biomassChange == 0)
[1] 0
# How many plots have positive change in biomass?
(X <- sum(clearcuts$biomassChange > 0))
[1] 21
# Perform binomial test
binom.test(X, n=36, p=0.5)
Exact binomial test
data: X and 36
number of successes = 21, number of trials = 36, p-value = 0.405
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.4075652 0.7448590
sample estimates:
probability of success
0.5833333
Definition: The
Mann-Whitney \( U \)-test compares the distributions of two groups. It does not require as many assumptions as the two-sample \( t \)-test.
Assignment Problem #32
T and B lymphocytes are normal components of the immune system, but in multiple sclerosis they become autoreactive and attack the central nervous system. What triggers the autoimmune process? One hypothesis is that the disease is initiated by environmental factors, especially microbial infection. However, recent work by Berer et al. (2011) on the mouse model of the disease suggests that the autoimmune process is triggered by nonpathogenic microbes living in the gut.
They compared onset of autoimmune encephalomyelitis in two treatment groups of mice from a strain that carries transgenic human CD4\( ^{+} \) T cells, which initiate the disease. One group (GF) was kept free of nonpathogenic gut microbes and all pathogens. The other (SPF) was only pathogen-free and served as controls. They measured percentage of T cells producing the molecule, interleukin-17, in tissue samples from 16 mice in the two groups.
Look at the data
treatment percentInterleukin17
1 SPF 18.87
2 SPF 15.65
3 SPF 13.45
4 SPF 12.95
5 SPF 6.01
6 SPF 5.84
Discuss: Is this data in tidy or messy format?
Answer: Tidy
Look at data
Discuss: Discuss the data with respect to meeting assumptions of statistical tests.
Look at data
Look at data
mydata %>%
ggplot(aes(x = percentInterleukin17)) +
geom_histogram(binwidth=4) +
facet_grid(treatment ~ .) +
xlab("Percent interleukin-17") +
ylab("Frequency")
Mann-Whitney \( U \)-test (Wilcoxon rank-sum test)
\( H_{0} \): The distribution of interleukin-17 is the same in the two groups.
\( H_{A} \): The distribution of interleukin-17 is NOT the same in the two groups.
wilcox.test(percentInterleukin17 ~ treatment, data = mydata)
Mann-Whitney \( U \)-test (Wilcoxon rank-sum test)
Wilcoxon rank sum exact test
data: percentInterleukin17 by treatment
W = 6, p-value = 0.004662
alternative hypothesis: true location shift is not equal to 0
Conclusion?
The Mann-Whitney \( U \)-test tests if the distributions are the same.
If the distributions of the two groups have the same shape (same variance and skew), then the Mann-Whitney \( U \)-test can be used to compare the locations (means or medians) of the two groups (see Example 13.5).
It is for this reason that this test gets misused a lot in the literature.