Earlier in the semester, we spent some time on “robust” versus “non-robust”, what to do about outliers. The rank transform offers a simple way to use the modeling framework and handle issues relating to robustness.
“Theorie ist, wenn man alles weiss, aber nichts funktioniert. Praxis ist, wenn alles funktioniert, aber niemand weiss warum. Hier ist Theorie und Praxis vereint: nichts funktioniert… und niemand weiss wieso!” — Albert Einstein
In his graduate work at the University of Fairbanks, Alaska, Mike Anderson studied nitrogen fixing by bacteria.
He writes:
My dissertation research centers on symbiotic interactions between alder plants (Alnus spp.) and nitrogen-fixing Frankia bacteria on the 100-mile-wide floodplain of the Tanana River near Fairbanks. In this area the symbiosis with Frankia allows alder to colonize newly-formed river terraces, which are very low in soil nitrogen (N). The competitive edge provided to alder by Frankia during this process is short-lived, however, because the alders quickly enhance the availability of N in the soil, which helps other plant species colonize these sites. Large changes in the community and ecosystem follow over the next ~150 years, which changes the environmental context of the relationship between alder and Frankia. How this relationship responds to these changes is the subject of my dissertation research.
One question in his research is how genetic variation in Frankia influences the amount of of nitrogen in the soil. With great effort, he collected data, which we have in the alder.csv data set.
alder = fetchData("alder.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/alder.csv
Of particular interest to us today are these variables:
SNF: specific nitrogen fixationRF: the genetic characterization of the bacteria by restriction fragmentation (thus, “RF”)ONECM: the temperature measured at 1cm soil depthFIVECM: the temperature measured at 5cm soid depthPERH2O: the percentage of the soil that's waterSITE: an identifier for the site locationLAND: the type of the siteSAMPPER: the time in the season when the information was collected. A specific Julian calendar day is given by JULDAYLet's do a simple test of whether nitrogen fixation is related to the genetic classification of the bacteria:
anova(lm(SNF ~ RF, data = alder))
## Analysis of Variance Table
##
## Response: SNF
## Df Sum Sq Mean Sq F value Pr(>F)
## RF 5 23646 4729 4.84 0.00037 ***
## Residuals 165 161351 978
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Questions:
It's always sensible to look at the distribution of your data. For instance:
bwplot(SNF ~ RF, data = alder)
I'm a little bit concerned here about the outliers. Maybe they are skewing the results. Perhaps a single point is being overly influential.
Easy to find out: We'll construct the sampling distribution of F by randomization:
s = do(1000) * anova(lm(SNF ~ shuffle(RF), data = alder))[1, 4]
tally(~result > 4.8, data = s)
##
## TRUE FALSE Total
## 1 999 1000
This is a similar result to what we got parametrically, that is, using the F distribution from theory.
When in doubt about whether a result is being influenced by outliers, you can do the parametric test on the rank of the data.
ACTIVITY:
Here's the result on the rank:
anova(lm(rank(SNF) ~ RF + PERH2O, data = alder))
## Analysis of Variance Table
##
## Response: rank(SNF)
## Df Sum Sq Mean Sq F value Pr(>F)
## RF 5 30684 6137 2.01 0.079 .
## PERH2O 1 19047 19047 6.23 0.013 *
## Residuals 189 577714 3057
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Is this p-value right? We can check it by randomization:
s = do(1000) * anova(lm(rank(SNF) ~ shuffle(RF), data = alder))[1, 4]
tally(~result > 1.9538, data = s)
##
## TRUE FALSE Total
## 84 916 1000
Very close!
QUESTION: How should I decide if the result is close to 0.087?
There are other transformations that can be used. Economists often use the logarithm, which makes sense for prices. Biologists sometimes use logs or square roots, and for proportions they use arc-sines. The rank is a good general-purpose solution. It's really only an issue if there are outliers.
Let's stick with the rank and see if we can improve the p-value a bit by including covariates. Maybe water and soil temperature have an effect?
anova(lm(rank(SNF) ~ RF + PERH2O, data = alder))
## Analysis of Variance Table
##
## Response: rank(SNF)
## Df Sum Sq Mean Sq F value Pr(>F)
## RF 5 30684 6137 2.01 0.079 .
## PERH2O 1 19047 19047 6.23 0.013 *
## Residuals 189 577714 3057
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Water is clearly important! But it didn't eat up so much variance that it makes RF look much better.
anova(lm(rank(SNF) ~ RF + ONECM, data = alder))
## Analysis of Variance Table
##
## Response: rank(SNF)
## Df Sum Sq Mean Sq F value Pr(>F)
## RF 5 30684 6137 2.29 0.048 *
## ONECM 1 89576 89576 33.38 3.1e-08 ***
## Residuals 189 507185 2684
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Same story.
anova(lm(rank(SNF) ~ RF + FIVECM, data = alder))
## Analysis of Variance Table
##
## Response: rank(SNF)
## Df Sum Sq Mean Sq F value Pr(>F)
## RF 5 30684 6137 2.11 0.066 .
## FIVECM 1 47871 47871 16.48 7.2e-05 ***
## Residuals 189 548891 2904
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Why not try both temperatures? Compare these two ANOVA reports
anova(lm(rank(SNF) ~ RF + ONECM + FIVECM, data = alder))
## Analysis of Variance Table
##
## Response: rank(SNF)
## Df Sum Sq Mean Sq F value Pr(>F)
## RF 5 30684 6137 2.31 0.046 *
## ONECM 1 89576 89576 33.73 2.7e-08 ***
## FIVECM 1 7945 7945 2.99 0.085 .
## Residuals 188 499241 2656
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(rank(SNF) ~ RF + FIVECM + ONECM, data = alder))
## Analysis of Variance Table
##
## Response: rank(SNF)
## Df Sum Sq Mean Sq F value Pr(>F)
## RF 5 30684 6137 2.31 0.046 *
## FIVECM 1 47871 47871 18.03 3.4e-05 ***
## ONECM 1 49650 49650 18.70 2.5e-05 ***
## Residuals 188 499241 2656
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Maybe the time in the season?
anova(lm(rank(SNF) ~ RF + ONECM + FIVECM + SAMPPER, data = alder))
## Analysis of Variance Table
##
## Response: rank(SNF)
## Df Sum Sq Mean Sq F value Pr(>F)
## RF 5 30684 6137 3.49 0.0049 **
## ONECM 1 89576 89576 50.90 2.1e-11 ***
## FIVECM 1 7945 7945 4.51 0.0349 *
## SAMPPER 2 171899 85949 48.84 < 2e-16 ***
## Residuals 186 327342 1760
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(rank(SNF) ~ RF + SAMPPER + FIVECM + ONECM, data = alder))
## Analysis of Variance Table
##
## Response: rank(SNF)
## Df Sum Sq Mean Sq F value Pr(>F)
## RF 5 30684 6137 3.49 0.0049 **
## SAMPPER 2 205613 102807 58.42 < 2e-16 ***
## FIVECM 1 55868 55868 31.75 6.4e-08 ***
## ONECM 1 7937 7937 4.51 0.0350 *
## Residuals 186 327342 1760
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
And the water? Maybe the time in the season?
anova(lm(rank(SNF) ~ RF + PERH2O + ONECM + FIVECM + SAMPPER, data = alder))
## Analysis of Variance Table
##
## Response: rank(SNF)
## Df Sum Sq Mean Sq F value Pr(>F)
## RF 5 30684 6137 3.47 0.0051 **
## PERH2O 1 19047 19047 10.76 0.0012 **
## ONECM 1 84166 84166 47.57 8.2e-11 ***
## FIVECM 1 14420 14420 8.15 0.0048 **
## SAMPPER 2 151789 75894 42.89 5.0e-16 ***
## Residuals 185 327339 1769
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(rank(SNF) ~ RF + SAMPPER + FIVECM + ONECM + PERH2O, data = alder))
## Analysis of Variance Table
##
## Response: rank(SNF)
## Df Sum Sq Mean Sq F value Pr(>F)
## RF 5 30684 6137 3.47 0.0051 **
## SAMPPER 2 205613 102807 58.10 <2e-16 ***
## FIVECM 1 55868 55868 31.57 7e-08 ***
## ONECM 1 7937 7937 4.49 0.0355 *
## PERH2O 1 3 3 0.00 0.9667
## Residuals 185 327339 1769
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Which is it? Is the water important or not?
Alas, this same problem appears with RF itself.
anova(lm(rank(SNF) ~ RF + PERH2O + ONECM + FIVECM + SAMPPER, data = alder))
## Analysis of Variance Table
##
## Response: rank(SNF)
## Df Sum Sq Mean Sq F value Pr(>F)
## RF 5 30684 6137 3.47 0.0051 **
## PERH2O 1 19047 19047 10.76 0.0012 **
## ONECM 1 84166 84166 47.57 8.2e-11 ***
## FIVECM 1 14420 14420 8.15 0.0048 **
## SAMPPER 2 151789 75894 42.89 5.0e-16 ***
## Residuals 185 327339 1769
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(rank(SNF) ~ PERH2O + ONECM + FIVECM + SAMPPER + RF, data = alder))
## Analysis of Variance Table
##
## Response: rank(SNF)
## Df Sum Sq Mean Sq F value Pr(>F)
## PERH2O 1 17320 17320 9.79 0.002 **
## ONECM 1 81807 81807 46.23 1.4e-10 ***
## FIVECM 1 33114 33114 18.71 2.5e-05 ***
## SAMPPER 2 160314 80157 45.30 < 2e-16 ***
## RF 5 7551 1510 0.85 0.514
## Residuals 185 327339 1769
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RF important or not?RF is related to nitrogen fixing.RF is not directly related to nitrogen fixing.SITE and SITE is strongly related to water, temperature, …]There's probably no compelling way to resolve the ambiguity about RF from these data. You can construct a theory in the form of a model of how the system works and then construct your model based on the theory, blocking backdoor pathways as appropriate. But, as Einstein said:
A theory is something nobody believes, except the person who made it.
Much more convincing to construct an experiment. How might this be done with the alder situation?
Review blocking in this context: Divide up the plots into groups. Within each group, all the plots are similar. Then randomly assign the bacteria (or water, or soil temperature) within each block.
To give the whole of Einstein's quote: A theory is something nobody believes, except the person who made it. An experiment is something everybody believes, except the person who made it.
First, a preliminary observational study of 1000 people.
fetchData("simulate.r")
## [1] TRUE
Obs = run.sim(aspirin, 1000)
coef(summary(glm(stroke == "Y" ~ mgPerDay, data = Obs, family = "binomial")))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.911949 0.12127 -7.52 5.482e-14
## mgPerDay -0.003871 0.00225 -1.72 8.536e-02
coef(summary(glm(stroke == "Y" ~ mgPerDay + sick, data = Obs, family = "binomial")))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.86383 0.184014 -10.129 4.119e-24
## mgPerDay -0.04402 0.005265 -8.361 6.242e-17
## sick 0.79704 0.053974 14.767 2.388e-49
Without including the covariate sick, we can't see any effect of aspirin. With sick in place, it looks like aspirin is somewhat protective.
QUESTION: How does aspirin change the odds of having a stroke? [Ans. : Pick a dosage for aspirin, multiply by the coefficient, and undo the log.]
But we're going to run into problems with getting people to accept the results of this study:
sick in practice.Questions we can answer without problem:
But to convince people, let's do an experiment.
Step 1: Decide what dose to give people.
Step 2: Run the experiment:
Ex = run.sim(aspirin, 1000, mgPerDay = c(0, 100))
coef(summary(glm(stroke == "Y" ~ mgPerDay, data = Ex, family = "binomial")))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.36397 0.090928 -4.003 6.260e-05
## mgPerDay -0.01418 0.001565 -9.064 1.261e-19
coef(summary(glm(stroke == "Y" ~ mgPerDay + sick, data = Ex, family = "binomial")))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.76333 0.14325 -12.309 8.068e-35
## mgPerDay -0.03593 0.00373 -9.633 5.816e-22
## sick 0.69075 0.05004 13.805 2.373e-43
The coefficient changes. In a linear model, it wouldn't change this much. But nonlinear models are more complicated.
Is there evidence for an interaction?
coef(summary(glm(stroke == "Y" ~ mgPerDay * sick, data = Ex, family = "binomial")))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.7323146 0.152285 -11.3755 5.541e-30
## mgPerDay -0.0386253 0.006298 -6.1327 8.638e-10
## sick 0.6663677 0.064799 10.2837 8.348e-25
## mgPerDay:sick 0.0005739 0.001028 0.5582 5.767e-01
Nah.
The experiment is pretty clean. We could have used many fewer people: smaller \( n \).
In reality, we're not going to have perfect compliance with our instructions. Some people given a placebo will take aspirin anyways. Some people won't take their aspirin.
To simulate this, we'll tell the program to add in an influence without severing the connection from sick to mgPerDay.
experiment.size = 200
influence = resample(c(30, 70), experiment.size)
Ex2 = run.sim(aspirin, experiment.size, mgPerDay = influence, inject = TRUE)
Show that with inject=TRUE, the values of mgPerDay reflect both the input and the variable sick
Now build the model:
coef(summary(glm(stroke == "Y" ~ mgPerDay, data = Ex2, family = "binomial")))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.00171 0.556046 -1.801 0.07163
## mgPerDay -0.01034 0.005946 -1.739 0.08199
coef(summary(glm(stroke == "Y" ~ mgPerDay + sick, data = Ex2, family = "binomial")))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3975 0.98430 -2.436 1.486e-02
## mgPerDay -0.0357 0.01021 -3.498 4.683e-04
## sick 0.7052 0.13079 5.392 6.967e-08
The result is still ambiguous. (A larger experiment would fix this)
Intent to treat means to use the assigned values rather than the measured values.
coef(summary(glm(stroke == "Y" ~ influence, data = Ex2, family = "binomial")))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.33223 0.60902 0.5455 0.5854037
## influence -0.05411 0.01585 -3.4143 0.0006395
coef(summary(glm(stroke == "Y" ~ influence + sick, data = Ex2, family = "binomial")))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.74767 1.04113 -2.639 8.312e-03
## influence -0.05859 0.01813 -3.232 1.228e-03
## sick 0.59521 0.11949 4.981 6.312e-07
This idea of using intent rather than the actual treatment is counter-intuitive. But it's easily understood in terms of the causal diagrams. There's no back-door pathway from intent to stroke.
When you go to a different country, it helps to learn some of the basics of the culture: where it's polite to slurp your soup, how many times to kiss when greeting, whether to shake your head to say “yes”.
In this course, you've been learning a particular approach to statistical reasoning based on the idea of fitting models. It's a very general and powerful approach. But it's not the approach that most people learn. You need to be aware of this when working with others: they will find incomprehensible some of the language that you use fluently. In particular:
lm() and multiple variables.resample() and shuffle()) are obscure terms, not typically understood.summary( lm( A-null ~ 1) )summary( lm( A~G) )summary( lm( A=="Yes" ~ G) )anova( lm( A~G) )rank() before fitting the modelChi-squared is a much beloved test, one of the first statistical tests. There's also a chi-squared distribution, which is analogous to an F distribution or a t distribution and has, as a parameter, “degrees of freedom”. It is associated with Karl Pearson the dean of statistics at the turn of the 19th-20th century.
The \( \chi^2 \) test is used for two distinct but related purposes:
Example: In genetics, di-hybrid cross involves mating two organisms, one with a dominant trait and one with a recessive trait. Their offspring are then mated, producing 4 phenotypes in the second generation. The second generation has, theoretically, a rate of expression of 9:3:3:1. For instance, in an agricultural experiment, 1301 plants have been grown from a dihybrid cross and the four phenotypes appear like this:
| Phenotype | Observed | Expected in Theory |
|---|---|---|
| Green | 773 | 731.9 |
| Golden | 231 | 243.9 |
| Green-striped | 238 | 243.9 |
| Golden-green-striped | 59 | 81.3 |
| TOTAL | 1301 | 1301 |
It's pretty obvious which phenotype corresponds to the “9” and which to the “1”. The question is whether the observation is consistent with the theory.
This example is drawn from Snedecor and Cochran (1989), Statistical Methods 8th ed. p 196. It draws on work done by Lindstrom (1918). That the example dates from 1918 and is still being used in 1989 says something about the pedigree of the \( \chi^2 \) test. Here are the data in a simple R form
LindstromObs = c(773, 231, 238, 59)
LindstromExp = 1301 * c(9, 3, 3, 1)/16
In this class, we've focussed on fitting models to observed data with the goal of relating explanatory variables to a response. In this dihybrid problem, however, the emphasis is different: we have a model and we want to know if the data are consistent with it.
We've already seen such questions in the context of the Null Hypothesis. So let's let the theory be our null hypothesis and ask if the data are consistent with the null. To do this, we need 1. a test statistic 2. the distribution of the test statistic under the Null.
An obvious test statistic to use is the sum of square residuals. Here's a program to calculate that:
ssResids = function(observed, expected) sum((observed - expected)^2)
Our test statistic:
ts = ssResids(observed = LindstromObs, expected = LindstromExp)
ts
## [1] 2397
To carry out a hypothesis test, we need to be able to generate samples from the Null Hypothesis. Since we know the rate at which each phenotype is being generated, we can use the poisson random number generator.
counts = rpois(4, lambda = LindstromExp)
counts
## [1] 732 264 287 75
This method is good, but it's not necessarily consistent with the total number of observations, 1301. So, a trick:
makeSample = function(expected) {
n = sum(expected)
table(resample(1:length(expected), prob = expected/n, size = round(n)))
}
Now, the sampling distribution under the Null Hypothesis:
s = do(1000) * ssResids(observed = makeSample(LindstromExp), expected = LindstromExp)
And the p-value:
mean(result > ts, data = s)
## [1] 0.034
The observed data are a little bit surprising.
As it happens, the test statistic that we developed historically is a bit different:
chisq.stat = function(observed, expected) {
sum((observed - expected)^2/expected)
}
Running the hypothesis test. Here, the sampling distribution under the null:
s2 = do(1000) * chisq.stat(observed = makeSample(LindstromExp), expected = LindstromExp)
t2 = chisq.stat(observed = LindstromObs, expected = LindstromExp)
mean(result > t2, data = s2)
## [1] 0.029
The p-value is lower than before, in fact below 0.05. (Remember, there's nothing magical about 0.05. It's just a convention.) Quoting Snedecor: “Lindstrom commented that the deviations [from Mendelian genetic theory] could be largely explained by a physiological cause, namely the weakened condition of the last three classes due to their chlorophyll abnormality. He pointed out in particular that the last class (golden-green-striped) was not very vigorous.”
The change in the p-value occurred because the division by expected in the \( \chi^2 \) statistic means that equal weight is being put on each of the 4 categories. In the simple sum of squares, equal weight is put on each of the cases.
There's a reason to prefer the \( \chi^2 \) statistic to the simple sum of square residuals. The reason is that we actually know what will be the variance of a poisson random variable.
ACTIVITY: Generate a large poisson sample of a given \( \lambda \). Find the standard deviation. Try this for several different \( \lambda \), say \( \lambda=25,100,400,1600,6400 \).
By dividing (observed-expected)2 by expected, we are effectively creating a z2 variable, the square of a normal random variable with mean=0 and sd=1.
There was some controversy in the early days about what the distribution of the sum of \( k \) z2 variables should be. It was called the \( \chi^2 \) distribution, but it was observed that adding up three z2 variables was closer to what was observed for 4 categories in the chi-squared test than adding up four z2 variables.
Of course, making such a finding required that there be some definition of “closer to”. This involved some argument. Remember, in 1900, calculations were difficult. But now it's easy to see that three is better than four. For example, when the distributions match, plotting one versus another (in sorted order) should give a straight line along the diagonal.
z3 = do(1000) * sum(rnorm(3)^2)
z4 = do(1000) * sum(rnorm(4)^2)
plotPoints(sort(z3$result) ~ sort(s2$result))
plotFun(x ~ x, add = TRUE, col = "red")
plotPoints(sort(z4$result) ~ sort(s2$result))
plotFun(x ~ x, add = TRUE, col = "red")
mean(result, data = z3)
## [1] 2.956
mean(result, data = z4)
## [1] 3.961
mean(result, data = s2)
## [1] 2.969
What's going on here is that the four observations had to add up to 515. In other words, the residuals must add up to zero. This is very much what happens when we fit a model: the residuals must add up to zero because we've fitted the model. In this case, we've fitted the expectation so that it must add up to 515. This requirement takes one degree of freedom out of the expected number. Historically, thus was the concept of degrees of freedom born.
In the \( \chi^2 \) test you can see the origins of many of the concepts that have shaped our approach to statistical modeling: residuals, sum of square residuals as a figure of merit, the sampling distribution under a null hypothesis, degrees of freedom.
An example from Snedecor (1989, p. 202) looks at 196 patients classified according to change in health of leprosy patients and degree of skin damage (called “infiltration”).
| Degree of Infiltration | Marked Improvement | Moderate | Slight | Stationary | Worse | TOTAL |
|---|---|---|---|---|---|---|
| Little | 11 | 27 | 42 | 53 | 11 | 144 |
| Much | 7 | 15 | 16 | 13 | 1 | 52 |
| TOTAL | 18 | 42 | 58 | 66 | 12 | 196 |
In our modeling framework, we would likely see whether there is a relationship between “Degree of Infiltration” and Improvement by making a logistic regression model with “Degree of Infiltration” as the response. The \( \chi^2 \) test constructs a null hypothesis probability model by assuming that the two variables are independent and then tests the chi-square statistic against that model. Not much different.
Absent a modeling technique for multi-level response variables, we don't have an exact analog of the chi-squared test using modeling.
Note that the appropriate test here is actually “Fisher's Exact Test”, which avoids an approximation made in the \( \chi^2 \) test.
Here's a simulation example with a continuous predictor from UCLA. The response is the number of awards given to students. The explanatory variables are the score on a math exam and the type of program in which the student was enrolled. The case is an individual student.
p <- read.csv("http://www.ats.ucla.edu/stat/data/poisson_sim.csv")
p <- within(p, {
prog <- factor(prog, levels = 1:3, labels = c("General", "Academic", "Vocational"))
id <- factor(id)
})
summary(p)
## id num_awards prog math
## 1 : 1 Min. :0.00 General : 45 Min. :33.0
## 2 : 1 1st Qu.:0.00 Academic :105 1st Qu.:45.0
## 3 : 1 Median :0.00 Vocational: 50 Median :52.0
## 4 : 1 Mean :0.63 Mean :52.6
## 5 : 1 3rd Qu.:1.00 3rd Qu.:59.0
## 6 : 1 Max. :6.00 Max. :75.0
## (Other):194
And the analysis:
m1 = glm(num_awards ~ prog + math, family = "poisson", data = p)
summary(m1)
##
## Call:
## glm(formula = num_awards ~ prog + math, family = "poisson", data = p)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.204 -0.844 -0.511 0.256 2.680
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.2471 0.6585 -7.97 1.6e-15 ***
## progAcademic 1.0839 0.3583 3.03 0.0025 **
## progVocational 0.3698 0.4411 0.84 0.4018
## math 0.0702 0.0106 6.62 3.6e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 287.67 on 199 degrees of freedom
## Residual deviance: 189.45 on 196 degrees of freedom
## AIC: 373.5
##
## Number of Fisher Scoring iterations: 6
m2 = glm(num_awards ~ prog * math, family = "poisson", data = p)
summary(m2)
##
## Call:
## glm(formula = num_awards ~ prog * math, family = "poisson", data = p)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.229 -0.796 -0.530 0.253 2.683
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.8618 2.4932 -1.55 0.12
## progAcademic -0.4411 2.6030 -0.17 0.87
## progVocational -0.8447 2.8699 -0.29 0.77
## math 0.0440 0.0472 0.93 0.35
## progAcademic:math 0.0284 0.0487 0.58 0.56
## progVocational:math 0.0229 0.0542 0.42 0.67
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 287.67 on 199 degrees of freedom
## Residual deviance: 189.10 on 194 degrees of freedom
## AIC: 377.2
##
## Number of Fisher Scoring iterations: 6
No sign of an interaction.
This corresponds to the interaction term.
Do an example with two categorical variables.
Work through the degrees of freedom.
The techniques that we've studied in this course were born at the end of the 19th century, had their adolescence in the early part of the 20th century. They were developed originally for the analysis of laboratory experiments, but quickly were adopted to the analysis of complex observational studies.
The ideas behind resampling and randomization emerged in the early part of the 20th century, but they were impractical until electronic computers were invented.
Only now, in the 2010s are the computational ideas becoming mainstream in statistical education, so hardly anyone older than you will know about them.
Where things are heading: