Experimental Design

M. Drew LaMar
October 23, 2020

“There are no null results; there are only insufficiently clever choices of \( H_0 \). ”

- @richarddmorey

Use a nonparametric method

Ignore the violations of assumptions
Transform the data
~~Use a nonparametric method~~
Use a permutation test (computer-intensive methods)

Definition: A nonparametric method makes fewer assumptions than standard parametric methods do about the distribution of the variables.

Property: Nonparametric methods are usually based on the ranks of the data points (medians, quartiles, etc.)

Property: Nonparametric tests are typically less powerful than parametric tests.

Use a nonparametric method

A nonparametric alternative to the one-sample \( t \)-test is the sign test.

Definition: The sign test compares the median of a sample to a constant specified in the null hypothesis. It makes no assumptions about the distribution of the measurements in the population.

A nonparametric alternative to the two-sample \( t \)-test is the Mann-Whitney \( U \)-test.

Definition: The Mann-Whitney \( U \)-test compares the distributions of two groups. It does not require as many assumptions as the two-sample \( t \)-test.

Sign test: Binomial test in disguise

Algorithm:

First, state a null hypothesized median.
Label all measurements larger than this median with a “\( + \)”, and all measurements smaller than this median with a “\( - \)”.
Throw out any measurements exactly equal to the median (sample size is reduced by this amount)
Use binomial test with the test statistic the number of “\( + \)” values (or \( - \) values), comparing the result to the null proportion \( p_{0}=0.5 \).

Sign test has very little power. If \( n \leq 5 \), then can't use sign test.

Example: Rainforests

Assignment problem #25

Researchers have observed that rainforest areas next to clear-cuts (less than 100 meters away) have a reduced tree biomass compared to rainforest areas far from clear-cuts. To go further, Laurance et al. (1997) tested whether rainforest areas more distant from the clear-cuts were also affected. They compiled data on the biomass change after clear-cutting (in tons/hectare/year) for 36 rainforest areas between 100m and several kilometers from clear-cuts.

Example: Rainforests

Look at data

ggplot(data = clearcuts) +
  geom_histogram(aes(x = biomassChange), 
                 color = "black", bins = 8)

plot of chunk unnamed-chunk-3

Example: Transformations?

hist(exp(clearcuts$biomassChange))

plot of chunk unnamed-chunk-4

Use sign test

\( H_{0} \): The median change in biomass is zero.
\( H_{A} \): The median change in biomass is not zero.

# Any biomass equal to zero?
sum(clearcuts$biomassChange == 0)

[1] 0

# How many plots have positive change in biomass?
(X <- sum(clearcuts$biomassChange > 0))

[1] 21

Use sign test

# Perform binomial test
binom.test(X, n=36, p=0.5)


    Exact binomial test

data:  X and 36
number of successes = 21, number of trials = 36, p-value = 0.405
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.4075652 0.7448590
sample estimates:
probability of success 
             0.5833333

Nonparametric two-sample t-test

A nonparametric alternative to the two-sample \( t \)-test is the Mann-Whitney \( U \)-test.

Definition: The Mann-Whitney \( U \)-test compares the distributions of two groups. It does not require as many assumptions as the two-sample \( t \)-test.

Example: Autoimmune diseases and gut microbes

Assignment Problem #32

T and B lymphocytes are normal components of the immune system, but in multiple sclerosis they become autoreactive and attack the central nervous system. What triggers the autoimmune process? One hypothesis is that the disease is initiated by environmental factors, especially microbial infection. However, recent work by Berer et al. (2011) on the mouse model of the disease suggests that the autoimmune process is triggered by nonpathogenic microbes living in the gut.

Example: Autoimmune diseases and gut microbes

They compared onset of autoimmune encephalomyelitis in two treatment groups of mice from a strain that carries transgenic human CD4\( ^{+} \) T cells, which initiate the disease. One group (GF) was kept free of nonpathogenic gut microbes and all pathogens. The other (SPF) was only pathogen-free and served as controls. They measured percentage of T cells producing the molecule, interleukin-17, in tissue samples from 16 mice in the two groups.

Example: Autoimmune and gut microbes

Look at the data

  treatment percentInterleukin17
1       SPF                18.87
2       SPF                15.65
3       SPF                13.45
4       SPF                12.95
5       SPF                 6.01
6       SPF                 5.84

Discuss: Is this data in tidy or messy format?

Answer: Tidy

Example: Autoimmune and gut microbes

Look at data

plot of chunk unnamed-chunk-8

Discuss: Discuss the data with respect to meeting assumptions of statistical tests.

Example: Autoimmune and gut microbes

Look at data

plot of chunk ggplot

Example: Autoimmune and gut microbes

Look at data

mydata %>% 
  ggplot(aes(x = percentInterleukin17)) +
  geom_histogram(binwidth=4) +
  facet_grid(treatment ~ .) +
  xlab("Percent interleukin-17") + 
  ylab("Frequency")

Example: Autoimmune and gut microbes

Mann-Whitney \( U \)-test (Wilcoxon rank-sum test)

\( H_{0} \): The distribution of interleukin-17 is the same in the two groups.
\( H_{A} \): The distribution of interleukin-17 is NOT the same in the two groups.

wilcox.test(percentInterleukin17 ~ treatment, data = mydata)

Example: Autoimmune and gut microbes

Mann-Whitney \( U \)-test (Wilcoxon rank-sum test)


    Wilcoxon rank sum exact test

data:  percentInterleukin17 by treatment
W = 6, p-value = 0.004662
alternative hypothesis: true location shift is not equal to 0

Conclusion?

Assumptions of Mann-Whitney U-test

The Mann-Whitney \( U \)-test tests if the distributions are the same.

If the distributions of the two groups have the same shape (same variance and skew), then the Mann-Whitney \( U \)-test can be used to compare the locations (means or medians) of the two groups (see Example 13.5).

~~It is for this reason that this test gets misused a lot in the literature.~~

Between-individual variation

Goal: Understand variation

We want to distinguish between variation of interest and variation from other sources (again, increase signal-to-noise ratio).

“Whenever we carry out an experiment or observational study, we are either interested in measuring random variation, or (more often) trying to find ways to remove or reduce the effects of random variation, so that the effects that we care about can be seen more clearly.”

Replication

Definition: Replication involves making the same manipulations to and taking the same measurements on a number of different, independent experimental subjects.

We are essentially talking about sample size here, but there is more to it than that due to the independence issue.

http://www.zoology.ubc.ca/~whitlock/kingfisher/SamplingNormal.htm

Example: Does sex have an effect on human height?

Measure height in 10 married couples of the opposite sex. Are we safe in restricting our sample to married couples?

Pseudoreplication

Definition: Pseudoreplicates are dependent measures.

Definition: Replicate measures must be independent of each other, i.e. a measurement made on one individual should not provide any useful information about that factor on another individual.

Definition: Pseudoreplication occurs if we analyze pseudoreplicates as if they were replicates.

When we pseudoreplicate, we are making a false claim about the amount of replication.

Pseudoreplication

Both accuracy and precision are affected by pseudoreplication!

Accuracy: Pseudoreplication changes our question from general and interesting to more specific and less interesting.

Precision: Pseudoreplication underestimates the precision due to dependence of measures of interest.

Pseudoreplication - Example

Question: Do blue tit nestlings raised in nest boxes suffer more from external parasites than those raised in natural cavities?

Experimental Design: Investigate the four nestlings in a particular nest box and count the number of parasites on each.

Discuss: Why are these pseudoreplicates? Explain how they affect precision and accuracy.

Answer #1: This design gives you information on parasite load only for birds in this particular nest box.

Answer #2: Nestlings will be similar in many ways due to sharing nest box, and thus variation of parasite load between nestlings in this box will be smaller than between all nestlings.

Common sources of pseudoreplication

Pseudoreplication is a biological and experimental design issue, not a statistical issue. Data doesn't look pseudoreplicated.

Common sources of pseudoreplication:

A shared enclosure
- Environment affects all individuals similarly
- Individuals can affect other individuals
The common environment
Relatedness
Pseudoreplicated stimulus
Measurements over time
Species comparisons

Solutions to pseudoreplication

Make sure you are…

accounting for all possible variation.
controlling for possible confounding variables.

Experimental study: Make sure individuals differ systematically only in the explanatory variable(s) of interest.

Observational studies: Be aware of confounding variables.

Random sampling, or randomization, can solve many of these problems.

Blocking is another technique to address pseudoreplication in experimental studies.

Matching is analogous to blocking for observational studies.

Examples, examples, examples

Birdsongs and attractiveness

Question: How do we measure relationship between male birdsongs and attractiveness to females?

Experimental Design: Record the complex song of one male and the simple song of another male, and then play these same two songs to each of 40 different females. Compute a confidence interval for the mean attractiveness of the two male songs.

Discuss: What is wrong with this design so far?

Answer: Each measure of female choice is a pseudoreplicate (\( n=40 \)).

Examples, examples, examples

Discuss: What is wrong with this design so far?

Answer: Each measure of female choice is a pseudoreplicate (\( n=40 \)).

Discuss: What can we do to correct for this pseudoreplication?

Answer: Record songs of 40 males with complex songs, and 40 separate males with simple songs. Each female should listen to a unique pair of songs, one simple and one complex. Design can get even more complicated than this.

Discuss: What are examples of confounding variables in the pseudoreplicated case?

Examples, examples, examples

Blood sugar levels

Experimental Design: Phlebotomist takes 15 samples from each of 10 patients, yielding a total of 150 measurements.

Discuss: What is the replicate and sample size in this situation? Why?

Examples, examples, examples

Antibiotics and bacterial growth rates

Experimental Design: Two agar plates: one with antibiotic, one without. Spread bacteria on both plates, let them grow for 24 hours, then measure diameter of 100 colonies on each plate?

Discuss: What is the replicate and sample size in this situation? Why?

What sample size should I use?

Three things:

Plan for precision (estimation)
Plan for power (hypothesis testing)
Plan for data loss

We'll use a two-sample \( t \)-test as the example in this section.

Plan for precision

We would like to compute a 95% confidence interval for \( \mu_{1}-\mu_{2} \).

\[ \bar{Y}_{1}-\bar{Y}_{2} \pm \mathrm{margin \ of \ error}, \]

where “margin of error” is the half-width of the 95% confidence interval.

In this case, the following formula is an approximation to the number of samples needed to achieve the desired margin of error (assuming balanced design, i.e. \( n_{1}=n_{2}=n \)):

\[ n \approx 8\left(\frac{\mathrm{margin \ of \ error}}{\sigma}\right)^{-2} \]

Plan for precision

Plan for power

Two-sample \( t \)-test:

\[ H_{0}: \mu_{1} - \mu_{2} = 0. \] \[ H_{A}: \mu_{1} - \mu_{2} \neq 0. \]

A conventional power to aim for is 0.80, i.e. we aim to prove \( H_{0} \) is false in 80% of experiments.

Assuming a significance level of 0.05, a quick approximation to the planned sample size \( n \) in each of two groups is

\[ n \approx 16\left(\frac{D}{\sigma}\right)^{-2}, \]

where \( D = |\mu_{1}-\mu_{2}| \) is the effect size.

Pwr package in R

library(pwr)

function	power calculations for
pwr.2p.test	two proportions (equal n)
pwr.2p2n.test	two proportions (unequal n)
pwr.anova.test	balanced one way ANOVA
pwr.chisq.test	chi-square test
pwr.f2.test	general linear model
pwr.p.test	proportion (one sample)
pwr.r.test	correlation
pwr.t.test	t-tests (one sample, 2 sample, paired)
pwr.t2n.test	t-test (two samples with unequal n)

Two-sample t-test example

Two-sample \( t \)-test with significance level 0.05, 80% power, and relative effect size \( d = \frac{|\mu_{1}-\mu_{2}|}{\sigma} = 0.3 \).

pwr.t.test(d=0.3, power=0.8, type="two.sample")


     Two-sample t test power calculation 

              n = 175.3847
              d = 0.3
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group

Two-sample t-test example

plot of chunk unnamed-chunk-13

Randomization

Definition: Proper randomization means that any individual experimental subject has the same chance as any other individual of finding itself in each experimental group, as well as prepared, setup, or measured in the same way.

Improper randomization can lead to the introduction of confounding variables in the experimental protocol itself (experimental artifacts).
Randomization breaks the association between possible confounding variables and the explanatory variable.
Randomization allows the causal relationship between the explanatory and response variables to be assessed.

Randomization

Definition: Proper randomization means that any individual experimental subject has the same chance as any other individual of finding itself in each experimental group, as well as prepared, setup, or measured in the same way.

Randomization does not eliminate variation by confounding variables, only their correlation with treatment.
Randomization ensures that variation by confounding variables is similar between treatment groups and occurs by chance alone.

Randomization Example

Question: Does a specific genetic modification to a tomato plant affect its growth rate?

Experimental Design: Place 50 genetically modified plants, and 50 unmodified plants, into individual pots with compost, and then put them all into a growth chamber.

Discuss: Where can improper randomization appear in this example?

Answer: For example:
- Difference in compost quality.
- Difference in temperature across chamber.

Randomization Example

Let's look at temperature as a possible confounding variable:

The above randomization would not remove temperature difference across chamber, but simply remove correlation with treatment.

What if we would like to reduce the variation from temperature? We can try blocking.

Blocking Example

Our attempt to control for temperature:

Discuss: What’s right and wrong with this particular design?

Blocking Example

This particular blocking design is properly replicated and randomized.

The variation due to temperature in each chamber has been reduced, so that the difference between treatments becomes more apparent.

There was a systematic difference of temperature across the original chamber. We have now adjusted the design to systematically account for this difference.

Match and adjust

What if you can't do experiments? Randomization does not apply here.

Two strategies are used to limit effects of confounding variables on a difference between treatments in a controlled observational study.

Definition: With matching, every individual in the treatment group is paired with a control individual having the same of closely similar values for the suspected confounding variable.

Definition: With adjustment, use a statistical method, such as analysis of covariance, to correct for differences between treatment and control groups in suspected confounding variables.

Proper randomization

Assigning treatments to subjects (one possibility):

List all \( n \) subjects, one per row, in a spreadsheet.
Use the computer to give each subject a random number.
Assign treatment A or B to those subjects receiving the lowest or highest numbers, respectively.

Randomization in time

Remember, randomization is important in all processes of the experiment, including preparation, setup, and measurement.

Randomize measurement of replicates in time:

Watching 50 hours of great tit courtship behaviour on video increases your ability to observe
After 10 hours of counting through a microscope, tiredeness kicks in
Aging equipment

This shows time of measurement could be a confounding factor.

Additional Reading

Whitlock & Schluter, Interleaf 2: Pseudoreplication (pp. 115-116)
Whitlock & Schluter, Chapter 14: Designing experiments