“Theorie ist, wenn man alles weiss, aber nichts funktioniert. Praxis ist, wenn alles funktioniert, aber niemand weiss warum. Hier ist Theorie und Praxis vereint: nichts funktioniert… und niemand weiss wieso!” — Albert Einstein
For later use:
fetchData("survey-processing.R")
## Retrieving from http://www.mosaic-web.org/go/datasets/survey-processing.R
## [1] TRUE
fetchData("simulate.r")
## Retrieving from http://www.mosaic-web.org/go/datasets/simulate.r
## [1] TRUE
The survey processing software includes fixCheckbox and LikertToQuant.
We'll work with these a bit later in the class.
In his graduate work at the University of Fairbanks, Alaska, Mike Anderson studied nitrogen fixing by bacteria.
He writes:
My dissertation research centers on symbiotic interactions between alder plants (Alnus spp.) and nitrogen-fixing Frankia bacteria on the 100-mile-wide floodplain of the Tanana River near Fairbanks. In this area the symbiosis with Frankia allows alder to colonize newly-formed river terraces, which are very low in soil nitrogen (N). The competitive edge provided to alder by Frankia during this process is short-lived, however, because the alders quickly enhance the availability of N in the soil, which helps other plant species colonize these sites. Large changes in the community and ecosystem follow over the next ~150 years, which changes the environmental context of the relationship between alder and Frankia. How this relationship responds to these changes is the subject of my dissertation research.
One question in his research is how genetic variation in Frankia influences the amount of of nitrogen in the soil. With great effort, he collected data, which we have in the alder.csv data set.
alder = fetchData("alder.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/alder.csv
Of particular interest to us today are these variables:
SNF: specific nitrogen fixationRF: the genetic characterization of the bacteria by restriction fragmentation (thus, “RF”)ONECM: the temperature measured at 1cm soil depthFIVECM: the temperature measured at 5cm soid depthPERH20: the percentage of the soil that's waterSITE: an identifier for the site locationLAND: the type of the siteSAMPPER: the time in the season when the information was collected. A specific Julian calendar day is given by JULDAYLet's do a simple test of whether nitrogen fixation is related to the genetic classification of the bacteria:
anova(lm(SNF ~ RF, data = alder))
## Analysis of Variance Table
##
## Response: SNF
## Df Sum Sq Mean Sq F value Pr(>F)
## RF 5 23646 4729 4.84 0.00037 ***
## Residuals 165 161351 978
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Questions:
It's always sensible to look at the distribution of your data. For instance:
bwplot(SNF ~ RF, data = alder)
I'm a little bit concerned here about the outliers. Maybe they are skewing the results. Perhaps a single point is being overly influential.
Easy to find out: We'll construct the sampling distribution of F by randomization:
s = do(1000) * anova(lm(SNF ~ shuffle(RF), data = alder))[1, 4]
tally(~result > 4.8, data = s)
##
## TRUE FALSE Total
## 2 998 1000
This is a similar result to what we got parametrically, that is, using the F distribution from theory.
When in doubt about whether a result is being influenced by outliers, you can do the parametric test on the rank of the data.
ACTIVITY:
Here's the result on the rank:
anova(lm(rank(SNF) ~ RF + PERH20, data = alder))
## Error: object 'PERH20' not found
Is this p-value right? We can check it by randomization:
s = do(1000) * anova(lm(rank(SNF) ~ shuffle(RF), data = alder))[1, 4]
tally(~result > 1.9538, data = s)
##
## TRUE FALSE Total
## 80 920 1000
Very close!
QUESTION: How should I decide if the result is close to 0.087?
There are other transformations that can be used. Economists often use the logarithm, which makes sense for prices. Biologists sometimes use logs or square roots, and for proportions they use arc-sines. The rank is a good general-purpose solution. It's really only an issue if there are outliers.
Let's stick with the rank and see if we can improve the p-value a bit by including covariates. Maybe water and soil temperature have an effect?
anova(lm(rank(SNF) ~ RF + PERH2O, data = alder))
## Analysis of Variance Table
##
## Response: rank(SNF)
## Df Sum Sq Mean Sq F value Pr(>F)
## RF 5 30684 6137 2.01 0.079 .
## PERH2O 1 19047 19047 6.23 0.013 *
## Residuals 189 577714 3057
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Water is clearly important! But it didn't eat up so much variance that it makes RF look much better.
anova(lm(rank(SNF) ~ RF + ONECM, data = alder))
## Analysis of Variance Table
##
## Response: rank(SNF)
## Df Sum Sq Mean Sq F value Pr(>F)
## RF 5 30684 6137 2.29 0.048 *
## ONECM 1 89576 89576 33.38 3.1e-08 ***
## Residuals 189 507185 2684
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Same story.
anova(lm(rank(SNF) ~ RF + FIVECM, data = alder))
## Analysis of Variance Table
##
## Response: rank(SNF)
## Df Sum Sq Mean Sq F value Pr(>F)
## RF 5 30684 6137 2.11 0.066 .
## FIVECM 1 47871 47871 16.48 7.2e-05 ***
## Residuals 189 548891 2904
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Why not try both temperatures? Compare these two ANOVA reports
anova(lm(rank(SNF) ~ RF + ONECM + FIVECM, data = alder))
## Analysis of Variance Table
##
## Response: rank(SNF)
## Df Sum Sq Mean Sq F value Pr(>F)
## RF 5 30684 6137 2.31 0.046 *
## ONECM 1 89576 89576 33.73 2.7e-08 ***
## FIVECM 1 7945 7945 2.99 0.085 .
## Residuals 188 499241 2656
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(rank(SNF) ~ RF + FIVECM + ONECM, data = alder))
## Analysis of Variance Table
##
## Response: rank(SNF)
## Df Sum Sq Mean Sq F value Pr(>F)
## RF 5 30684 6137 2.31 0.046 *
## FIVECM 1 47871 47871 18.03 3.4e-05 ***
## ONECM 1 49650 49650 18.70 2.5e-05 ***
## Residuals 188 499241 2656
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Maybe the time in the season?
anova(lm(rank(SNF) ~ RF + ONECM + FIVECM + SAMPPER, data = alder))
## Analysis of Variance Table
##
## Response: rank(SNF)
## Df Sum Sq Mean Sq F value Pr(>F)
## RF 5 30684 6137 3.49 0.0049 **
## ONECM 1 89576 89576 50.90 2.1e-11 ***
## FIVECM 1 7945 7945 4.51 0.0349 *
## SAMPPER 2 171899 85949 48.84 < 2e-16 ***
## Residuals 186 327342 1760
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(rank(SNF) ~ RF + SAMPPER + FIVECM + ONECM, data = alder))
## Analysis of Variance Table
##
## Response: rank(SNF)
## Df Sum Sq Mean Sq F value Pr(>F)
## RF 5 30684 6137 3.49 0.0049 **
## SAMPPER 2 205613 102807 58.42 < 2e-16 ***
## FIVECM 1 55868 55868 31.75 6.4e-08 ***
## ONECM 1 7937 7937 4.51 0.0350 *
## Residuals 186 327342 1760
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
And the water? Maybe the time in the season?
anova(lm(rank(SNF) ~ RF + PERH2O + ONECM + FIVECM + SAMPPER, data = alder))
## Analysis of Variance Table
##
## Response: rank(SNF)
## Df Sum Sq Mean Sq F value Pr(>F)
## RF 5 30684 6137 3.47 0.0051 **
## PERH2O 1 19047 19047 10.76 0.0012 **
## ONECM 1 84166 84166 47.57 8.2e-11 ***
## FIVECM 1 14420 14420 8.15 0.0048 **
## SAMPPER 2 151789 75894 42.89 5.0e-16 ***
## Residuals 185 327339 1769
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(rank(SNF) ~ RF + SAMPPER + FIVECM + ONECM + PERH2O, data = alder))
## Analysis of Variance Table
##
## Response: rank(SNF)
## Df Sum Sq Mean Sq F value Pr(>F)
## RF 5 30684 6137 3.47 0.0051 **
## SAMPPER 2 205613 102807 58.10 <2e-16 ***
## FIVECM 1 55868 55868 31.57 7e-08 ***
## ONECM 1 7937 7937 4.49 0.0355 *
## PERH2O 1 3 3 0.00 0.9667
## Residuals 185 327339 1769
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Which is it? Is the water important or not?
Alas, this same problem appears with RF itself.
anova(lm(rank(SNF) ~ RF + PERH2O + ONECM + FIVECM + SAMPPER, data = alder))
## Analysis of Variance Table
##
## Response: rank(SNF)
## Df Sum Sq Mean Sq F value Pr(>F)
## RF 5 30684 6137 3.47 0.0051 **
## PERH2O 1 19047 19047 10.76 0.0012 **
## ONECM 1 84166 84166 47.57 8.2e-11 ***
## FIVECM 1 14420 14420 8.15 0.0048 **
## SAMPPER 2 151789 75894 42.89 5.0e-16 ***
## Residuals 185 327339 1769
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(rank(SNF) ~ PERH2O + ONECM + FIVECM + SAMPPER + RF, data = alder))
## Analysis of Variance Table
##
## Response: rank(SNF)
## Df Sum Sq Mean Sq F value Pr(>F)
## PERH2O 1 17320 17320 9.79 0.002 **
## ONECM 1 81807 81807 46.23 1.4e-10 ***
## FIVECM 1 33114 33114 18.71 2.5e-05 ***
## SAMPPER 2 160314 80157 45.30 < 2e-16 ***
## RF 5 7551 1510 0.85 0.514
## Residuals 185 327339 1769
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RF important or not?RF is related to nitrogen fixing.RF is not directly related to nitrogen fixing.SITE and SITE is strongly related to water, temperature, …]There's probably no compelling way to resolve the ambiguity about RF from these data. You can construct a theory in the form of a model of how the system works and then construct your model based on the theory, blocking backdoor pathways as appropriate. But, as Einstein said:
A theory is something nobody believes, except the person who made it.
Much more convincing to construct an experiment. How might this be done with the alder situation?
Review blocking in this context: Divide up the plots into groups. Within each group, all the plots are similar. Then randomly assign the bacteria (or water, or soil temperature) within each block.
To give the whole of Einstein's quote: A theory is something nobody believes, except the person who made it. An experiment is something everybody believes, except the person who made it.
First, a preliminary observational study of 1000 people.
Obs = run.sim(aspirin, 1000)
coef(summary(glm(stroke == "Y" ~ mgPerDay, data = Obs, family = "binomial")))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.884546 0.123245 -7.1771 7.118e-13
## mgPerDay -0.001713 0.002184 -0.7844 4.328e-01
coef(summary(glm(stroke == "Y" ~ mgPerDay + sick, data = Obs, family = "binomial")))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.54465 0.167213 -9.238 2.520e-20
## mgPerDay -0.03706 0.004535 -8.172 3.035e-16
## sick 0.69018 0.045421 15.195 3.800e-52
Without including the covariate sick, we can't see any effect of aspirin. With sick in place, it looks like aspirin is somewhat protective.
QUESTION: How does aspirin change the odds of having a stroke? [Ans. : Pick a dosage for aspirin, multiply by the coefficient, and undo the log.]
But we're going to run into problems with getting people to accept the results of this study:
sick in practice.Questions we can answer without problem:
But to convince people, let's do an experiment.
Step 1: Decide what dose to give people.
Step 2: Run the experiment:
Ex = run.sim(aspirin, 1000, mgPerDay = c(0, 100))
coef(summary(glm(stroke == "Y" ~ mgPerDay, data = Ex, family = "binomial")))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.4222 0.091443 -4.617 3.900e-06
## mgPerDay -0.0171 0.001717 -9.963 2.223e-23
coef(summary(glm(stroke == "Y" ~ mgPerDay + sick, data = Ex, family = "binomial")))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.69249 0.139337 -12.147 5.966e-34
## mgPerDay -0.03743 0.003806 -9.836 7.894e-23
## sick 0.64702 0.048453 13.354 1.130e-40
The coefficient changes. In a linear model, it wouldn't change this much. But nonlinear models are more complicated.
Is there evidence for an interaction?
coef(summary(glm(stroke == "Y" ~ mgPerDay * sick, data = Ex, family = "binomial")))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.691e+00 0.1478998 -11.43529 2.786e-30
## mgPerDay -3.755e-02 0.0061442 -6.11133 9.881e-10
## sick 6.460e-01 0.0629337 10.26530 1.010e-24
## mgPerDay:sick 2.411e-05 0.0009865 0.02444 9.805e-01
Nah.
The experiment is pretty clean. We could have used many fewer people: smaller \( n \).
In reality, we're not going to have perfect compliance with our instructions. Some people given a placebo will take aspirin anyways. Some people won't take their aspirin.
To simulate this, we'll tell the program to add in an influence without severing the connection from sick to mgPerDay.
experiment.size = 200
influence = resample(c(30, 70), experiment.size)
Ex2 = run.sim(aspirin, experiment.size, mgPerDay = influence, inject = TRUE)
Show that with inject=TRUE, the values of mgPerDay reflect both the input and the variable sick
Now build the model:
coef(summary(glm(stroke == "Y" ~ mgPerDay, data = Ex2, family = "binomial")))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.736244 0.51065 -3.4001 0.0006737
## mgPerDay 0.001939 0.00492 0.3941 0.6935051
coef(summary(glm(stroke == "Y" ~ mgPerDay + sick, data = Ex2, family = "binomial")))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.34413 0.802215 -2.922 3.477e-03
## mgPerDay -0.03301 0.009687 -3.408 6.549e-04
## sick 0.79870 0.128623 6.210 5.312e-10
The result is still ambiguous. (A larger experiment would fix this)
Intent to treat means to use the assigned values rather than the measured values.
coef(summary(glm(stroke == "Y" ~ influence, data = Ex2, family = "binomial")))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.03216 0.486610 -2.121 0.03391
## influence -0.01053 0.009388 -1.122 0.26200
coef(summary(glm(stroke == "Y" ~ influence + sick, data = Ex2, family = "binomial")))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.22782 0.82419 -3.916 8.990e-05
## influence -0.02909 0.01497 -1.944 5.188e-02
## sick 0.62896 0.09685 6.494 8.349e-11
This idea of using intent rather than the actual treatment is counter-intuitive. But it's easily understood in terms of the causal diagrams. There's no back-door pathway from intent to stroke.