Stats 155 Class Notes 2012-12-07

Quote for the Day:

“Theorie ist, wenn man alles weiss, aber nichts funktioniert. Praxis ist, wenn alles funktioniert, aber niemand weiss warum. Hier ist Theorie und Praxis vereint: nichts funktioniert… und niemand weiss wieso!” — Albert Einstein

Back to work …

For later use:

fetchData("survey-processing.R")

## Retrieving from http://www.mosaic-web.org/go/datasets/survey-processing.R

## [1] TRUE

fetchData("simulate.r")

## Retrieving from http://www.mosaic-web.org/go/datasets/simulate.r

## [1] TRUE

The survey processing software includes fixCheckbox and LikertToQuant. We'll work with these a bit later in the class.

A Scientific Question

In his graduate work at the University of Fairbanks, Alaska, Mike Anderson studied nitrogen fixing by bacteria.

He writes:

My dissertation research centers on symbiotic interactions between alder plants (Alnus spp.) and nitrogen-fixing Frankia bacteria on the 100-mile-wide floodplain of the Tanana River near Fairbanks. In this area the symbiosis with Frankia allows alder to colonize newly-formed river terraces, which are very low in soil nitrogen (N). The competitive edge provided to alder by Frankia during this process is short-lived, however, because the alders quickly enhance the availability of N in the soil, which helps other plant species colonize these sites. Large changes in the community and ecosystem follow over the next ~150 years, which changes the environmental context of the relationship between alder and Frankia. How this relationship responds to these changes is the subject of my dissertation research.

One question in his research is how genetic variation in Frankia influences the amount of of nitrogen in the soil. With great effort, he collected data, which we have in the alder.csv data set.

alder = fetchData("alder.csv")

## Retrieving from http://www.mosaic-web.org/go/datasets/alder.csv

Of particular interest to us today are these variables:

SNF: specific nitrogen fixation
RF: the genetic characterization of the bacteria by restriction fragmentation (thus, “RF”)
ONECM: the temperature measured at 1cm soil depth
FIVECM: the temperature measured at 5cm soid depth
PERH20: the percentage of the soil that's water
SITE: an identifier for the site location
LAND: the type of the site
SAMPPER: the time in the season when the information was collected. A specific Julian calendar day is given by JULDAY

Let's do a simple test of whether nitrogen fixation is related to the genetic classification of the bacteria:

anova(lm(SNF ~ RF, data = alder))

## Analysis of Variance Table
## 
## Response: SNF
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## RF          5  23646    4729    4.84 0.00037 ***
## Residuals 165 161351     978                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Questions:

How many cases were there?
How many levels of the genetic classification were there?

It's always sensible to look at the distribution of your data. For instance:

bwplot(SNF ~ RF, data = alder)

plot of chunk unnamed-chunk-5

I'm a little bit concerned here about the outliers. Maybe they are skewing the results. Perhaps a single point is being overly influential.

Easy to find out: We'll construct the sampling distribution of F by randomization:

s = do(1000) * anova(lm(SNF ~ shuffle(RF), data = alder))[1, 4]
tally(~result > 4.8, data = s)

## 
##  TRUE FALSE Total 
##     2   998  1000

This is a similar result to what we got parametrically, that is, using the F distribution from theory.

When in doubt about whether a result is being influenced by outliers, you can do the parametric test on the rank of the data.

ACTIVITY:

Show that the rank has a uniform distribution
Make a boxplot and a conditional density plot of the rank SNF

Here's the result on the rank:

anova(lm(rank(SNF) ~ RF + PERH20, data = alder))

## Error: object 'PERH20' not found

Is this p-value right? We can check it by randomization:

s = do(1000) * anova(lm(rank(SNF) ~ shuffle(RF), data = alder))[1, 4]
tally(~result > 1.9538, data = s)

## 
##  TRUE FALSE Total 
##    80   920  1000

Very close!

QUESTION: How should I decide if the result is close to 0.087?

There are other transformations that can be used. Economists often use the logarithm, which makes sense for prices. Biologists sometimes use logs or square roots, and for proportions they use arc-sines. The rank is a good general-purpose solution. It's really only an issue if there are outliers.

Let's stick with the rank and see if we can improve the p-value a bit by including covariates. Maybe water and soil temperature have an effect?

anova(lm(rank(SNF) ~ RF + PERH2O, data = alder))

## Analysis of Variance Table
## 
## Response: rank(SNF)
##            Df Sum Sq Mean Sq F value Pr(>F)  
## RF          5  30684    6137    2.01  0.079 .
## PERH2O      1  19047   19047    6.23  0.013 *
## Residuals 189 577714    3057                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Water is clearly important! But it didn't eat up so much variance that it makes RF look much better.

anova(lm(rank(SNF) ~ RF + ONECM, data = alder))

## Analysis of Variance Table
## 
## Response: rank(SNF)
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## RF          5  30684    6137    2.29   0.048 *  
## ONECM       1  89576   89576   33.38 3.1e-08 ***
## Residuals 189 507185    2684                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Same story.

anova(lm(rank(SNF) ~ RF + FIVECM, data = alder))

## Analysis of Variance Table
## 
## Response: rank(SNF)
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## RF          5  30684    6137    2.11   0.066 .  
## FIVECM      1  47871   47871   16.48 7.2e-05 ***
## Residuals 189 548891    2904                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Why not try both temperatures? Compare these two ANOVA reports

anova(lm(rank(SNF) ~ RF + ONECM + FIVECM, data = alder))

## Analysis of Variance Table
## 
## Response: rank(SNF)
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## RF          5  30684    6137    2.31   0.046 *  
## ONECM       1  89576   89576   33.73 2.7e-08 ***
## FIVECM      1   7945    7945    2.99   0.085 .  
## Residuals 188 499241    2656                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(lm(rank(SNF) ~ RF + FIVECM + ONECM, data = alder))

## Analysis of Variance Table
## 
## Response: rank(SNF)
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## RF          5  30684    6137    2.31   0.046 *  
## FIVECM      1  47871   47871   18.03 3.4e-05 ***
## ONECM       1  49650   49650   18.70 2.5e-05 ***
## Residuals 188 499241    2656                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Maybe the time in the season?

anova(lm(rank(SNF) ~ RF + ONECM + FIVECM + SAMPPER, data = alder))

## Analysis of Variance Table
## 
## Response: rank(SNF)
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## RF          5  30684    6137    3.49  0.0049 ** 
## ONECM       1  89576   89576   50.90 2.1e-11 ***
## FIVECM      1   7945    7945    4.51  0.0349 *  
## SAMPPER     2 171899   85949   48.84 < 2e-16 ***
## Residuals 186 327342    1760                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(lm(rank(SNF) ~ RF + SAMPPER + FIVECM + ONECM, data = alder))

## Analysis of Variance Table
## 
## Response: rank(SNF)
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## RF          5  30684    6137    3.49  0.0049 ** 
## SAMPPER     2 205613  102807   58.42 < 2e-16 ***
## FIVECM      1  55868   55868   31.75 6.4e-08 ***
## ONECM       1   7937    7937    4.51  0.0350 *  
## Residuals 186 327342    1760                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And the water? Maybe the time in the season?

anova(lm(rank(SNF) ~ RF + PERH2O + ONECM + FIVECM + SAMPPER, data = alder))

## Analysis of Variance Table
## 
## Response: rank(SNF)
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## RF          5  30684    6137    3.47  0.0051 ** 
## PERH2O      1  19047   19047   10.76  0.0012 ** 
## ONECM       1  84166   84166   47.57 8.2e-11 ***
## FIVECM      1  14420   14420    8.15  0.0048 ** 
## SAMPPER     2 151789   75894   42.89 5.0e-16 ***
## Residuals 185 327339    1769                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(lm(rank(SNF) ~ RF + SAMPPER + FIVECM + ONECM + PERH2O, data = alder))

## Analysis of Variance Table
## 
## Response: rank(SNF)
##            Df Sum Sq Mean Sq F value Pr(>F)    
## RF          5  30684    6137    3.47 0.0051 ** 
## SAMPPER     2 205613  102807   58.10 <2e-16 ***
## FIVECM      1  55868   55868   31.57  7e-08 ***
## ONECM       1   7937    7937    4.49 0.0355 *  
## PERH2O      1      3       3    0.00 0.9667    
## Residuals 185 327339    1769                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Which is it? Is the water important or not?

Alas, this same problem appears with RF itself.

anova(lm(rank(SNF) ~ RF + PERH2O + ONECM + FIVECM + SAMPPER, data = alder))

## Analysis of Variance Table
## 
## Response: rank(SNF)
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## RF          5  30684    6137    3.47  0.0051 ** 
## PERH2O      1  19047   19047   10.76  0.0012 ** 
## ONECM       1  84166   84166   47.57 8.2e-11 ***
## FIVECM      1  14420   14420    8.15  0.0048 ** 
## SAMPPER     2 151789   75894   42.89 5.0e-16 ***
## Residuals 185 327339    1769                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(lm(rank(SNF) ~ PERH2O + ONECM + FIVECM + SAMPPER + RF, data = alder))

## Analysis of Variance Table
## 
## Response: rank(SNF)
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## PERH2O      1  17320   17320    9.79   0.002 ** 
## ONECM       1  81807   81807   46.23 1.4e-10 ***
## FIVECM      1  33114   33114   18.71 2.5e-05 ***
## SAMPPER     2 160314   80157   45.30 < 2e-16 ***
## RF          5   7551    1510    0.85   0.514    
## Residuals 185 327339    1769                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

DISCUSSION: So is `RF` important or not?

There's evidence to support a claim that RF is related to nitrogen fixing.
There's evidence to support a claim that RF is not directly related to nitrogen fixing.
In everyday language, what is it about the biological/ecological/geological situation that creates the ambiguity? [That RF depends on SITE and SITE is strongly related to water, temperature, …]
Construct a model or other statistical analysis that supports your hypothesis in (3).

There's probably no compelling way to resolve the ambiguity about RF from these data. You can construct a theory in the form of a model of how the system works and then construct your model based on the theory, blocking backdoor pathways as appropriate. But, as Einstein said:

A theory is something nobody believes, except the person who made it.

Much more convincing to construct an experiment. How might this be done with the alder situation?

Not clear!
Try to innoculate the area with one bacterium or another?
Change the water? The soil temperature?
Given that we have reason to believe that location, water, temperature, etc. are important, we will want to arrange the different bacteria so that the space spanned by the variable is as orthogonal as possible to the quantities we measure.

Review blocking in this context: Divide up the plots into groups. Within each group, all the plots are similar. Then randomly assign the bacteria (or water, or soil temperature) within each block.

To give the whole of Einstein's quote: A theory is something nobody believes, except the person who made it. An experiment is something everybody believes, except the person who made it.

The Aspirin Experiment

First, a preliminary observational study of 1000 people.

Obs = run.sim(aspirin, 1000)
coef(summary(glm(stroke == "Y" ~ mgPerDay, data = Obs, family = "binomial")))

##              Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -0.884546   0.123245 -7.1771 7.118e-13
## mgPerDay    -0.001713   0.002184 -0.7844 4.328e-01

coef(summary(glm(stroke == "Y" ~ mgPerDay + sick, data = Obs, family = "binomial")))

##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -1.54465   0.167213  -9.238 2.520e-20
## mgPerDay    -0.03706   0.004535  -8.172 3.035e-16
## sick         0.69018   0.045421  15.195 3.800e-52

Without including the covariate sick, we can't see any effect of aspirin. With sick in place, it looks like aspirin is somewhat protective.

QUESTION: How does aspirin change the odds of having a stroke? [Ans. : Pick a dosage for aspirin, multiply by the coefficient, and undo the log.]

But we're going to run into problems with getting people to accept the results of this study:

They will argue that some other variable might be at work.
They will challenge how we measure sick in practice.
They will say that the study is overly simplistic.

Questions we can answer without problem:

Is there evidence for a relationship?
Is the magnitude of the relationship substantial?

But to convince people, let's do an experiment.

Step 1: Decide what dose to give people.
Step 2: Run the experiment:

Ex = run.sim(aspirin, 1000, mgPerDay = c(0, 100))
coef(summary(glm(stroke == "Y" ~ mgPerDay, data = Ex, family = "binomial")))

##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept)  -0.4222   0.091443  -4.617 3.900e-06
## mgPerDay     -0.0171   0.001717  -9.963 2.223e-23

coef(summary(glm(stroke == "Y" ~ mgPerDay + sick, data = Ex, family = "binomial")))

##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -1.69249   0.139337 -12.147 5.966e-34
## mgPerDay    -0.03743   0.003806  -9.836 7.894e-23
## sick         0.64702   0.048453  13.354 1.130e-40

The coefficient changes. In a linear model, it wouldn't change this much. But nonlinear models are more complicated.

Is there evidence for an interaction?

coef(summary(glm(stroke == "Y" ~ mgPerDay * sick, data = Ex, family = "binomial")))

##                 Estimate Std. Error   z value  Pr(>|z|)
## (Intercept)   -1.691e+00  0.1478998 -11.43529 2.786e-30
## mgPerDay      -3.755e-02  0.0061442  -6.11133 9.881e-10
## sick           6.460e-01  0.0629337  10.26530 1.010e-24
## mgPerDay:sick  2.411e-05  0.0009865   0.02444 9.805e-01

Nah.

The experiment is pretty clean. We could have used many fewer people: smaller \( n \).

Intent to Treat

In reality, we're not going to have perfect compliance with our instructions. Some people given a placebo will take aspirin anyways. Some people won't take their aspirin.

To simulate this, we'll tell the program to add in an influence without severing the connection from sick to mgPerDay.

experiment.size = 200
influence = resample(c(30, 70), experiment.size)
Ex2 = run.sim(aspirin, experiment.size, mgPerDay = influence, inject = TRUE)

Show that with inject=TRUE, the values of mgPerDay reflect both the input and the variable sick

Now build the model:

coef(summary(glm(stroke == "Y" ~ mgPerDay, data = Ex2, family = "binomial")))

##              Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -1.736244    0.51065 -3.4001 0.0006737
## mgPerDay     0.001939    0.00492  0.3941 0.6935051

coef(summary(glm(stroke == "Y" ~ mgPerDay + sick, data = Ex2, family = "binomial")))

##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -2.34413   0.802215  -2.922 3.477e-03
## mgPerDay    -0.03301   0.009687  -3.408 6.549e-04
## sick         0.79870   0.128623   6.210 5.312e-10

The result is still ambiguous. (A larger experiment would fix this)

Intent to treat means to use the assigned values rather than the measured values.

coef(summary(glm(stroke == "Y" ~ influence, data = Ex2, family = "binomial")))

##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.03216   0.486610  -2.121  0.03391
## influence   -0.01053   0.009388  -1.122  0.26200

coef(summary(glm(stroke == "Y" ~ influence + sick, data = Ex2, family = "binomial")))

##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -3.22782    0.82419  -3.916 8.990e-05
## influence   -0.02909    0.01497  -1.944 5.188e-02
## sick         0.62896    0.09685   6.494 8.349e-11

This idea of using intent rather than the actual treatment is counter-intuitive. But it's easily understood in terms of the causal diagrams. There's no back-door pathway from intent to stroke.