When you go to a different country, it helps to learn some of the basics of the culture: where it's polite to slurp your soup, how many times to kiss when greeting, whether to shake your head to say “yes”.
In this course, you've been learning a particular approach to statistical reasoning based on the idea of fitting models. It's a very general and powerful approach. But it's not the approach that most people learn. You need to be aware of this when working with others: they will find incomprehensible some of the language that you use fluently. In particular:
lm() and multiple variables.resample() and shuffle()) are obscure terms, not typically understood.summary( lm( A-null ~ 1) )summary( lm( A~G) )summary( lm( A=="Yes" ~ G) )anova( lm( A~G) )rank() before fitting the modelChi-squared is a much beloved test, one of the first statistical tests. There's also a chi-squared distribution, which is analogous to an F distribution or a t distribution and has, as a parameter, “degrees of freedom”. It is associated with Karl Pearson the dean of statistics at the turn of the 19th-20th century.
The \( \chi^2 \) test is used for two distinct but related purposes:
Example: In genetics, di-hybrid cross involves mating two organisms, one with a dominant trait and one with a recessive trait. Their offspring are then mated, producing 4 phenotypes in the second generation. The second generation has, theoretically, a rate of expression of 9:3:3:1. For instance, in an agricultural experiment, 1301 plants have been grown from a dihybrid cross and the four phenotypes appear like this:
| Phenotype | Observed | Expected in Theory |
|---|---|---|
| Green | 773 | 731.9 |
| Golden | 231 | 243.9 |
| Green-striped | 238 | 243.9 |
| Golden-green-striped | 59 | 81.3 |
| TOTAL | 1301 | 1301 |
It's pretty obvious which phenotype corresponds to the “9” and which to the “1”. The question is whether the observation is consistent with the theory.
This example is drawn from Snedecor and Cochran (1989), Statistical Methods 8th ed. p 196. It draws on work done by Lindstrom (1918). That the example dates from 1918 and is still being used in 1989 says something about the pedigree of the \( \chi^2 \) test. Here are the data in a simple R form
LindstromObs = c(773, 231, 238, 59)
LindstromExp = 1301 * c(9, 3, 3, 1)/16
In this class, we've focussed on fitting models to observed data with the goal of relating explanatory variables to a response. In this dihybrid problem, however, the emphasis is different: we have a model and we want to know if the data are consistent with it.
We've already seen such questions in the context of the Null Hypothesis. So let's let the theory be our null hypothesis and ask if the data are consistent with the null. To do this, we need 1. a test statistic 2. the distribution of the test statistic under the Null.
An obvious test statistic to use is the sum of square residuals. Here's a program to calculate that:
ssResids = function(observed, expected) sum((observed - expected)^2)
Our test statistic:
ts = ssResids(observed = LindstromObs, expected = LindstromExp)
ts
## [1] 2397
To carry out a hypothesis test, we need to be able to generate samples from the Null Hypothesis. Since we know the rate at which each phenotype is being generated, we can use the poisson random number generator.
counts = rpois(4, lambda = LindstromExp)
counts
## [1] 739 229 248 70
This method is good, but it's not necessarily consistent with the total number of observations, 1301. So, a trick:
makeSample = function(expected) {
n = sum(expected)
table(resample(1:length(expected), prob = expected/n, size = round(n)))
}
Now, the sampling distribution under the Null Hypothesis:
s = do(1000) * ssResids(observed = makeSample(LindstromExp), expected = LindstromExp)
And the p-value:
mean(result > ts, data = s)
## [1] 0.051
The observed data are a little bit surprising.
As it happens, the test statistic that we developed historically is a bit different:
chisq.stat = function(observed, expected) {
sum((observed - expected)^2/expected)
}
Running the hypothesis test. Here, the sampling distribution under the null:
s2 = do(1000) * chisq.stat(observed = makeSample(LindstromExp), expected = LindstromExp)
t2 = chisq.stat(observed = LindstromObs, expected = LindstromExp)
mean(result > t2, data = s2)
## [1] 0.031
The p-value is lower than before, in fact below 0.05. (Remember, there's nothing magical about 0.05. It's just a convention.) Quoting Snedecor: “Lindstrom commented that the deviations [from Mendelian genetic theory] could be largely explained by a physiological cause, namely the weakened condition of the last three classes due to their chlorophyll abnormality. He pointed out in particular that the last class (golden-green-striped) was not very vigorous.”
The change in the p-value occurred because the division by expected in the \( \chi^2 \) statistic means that equal weight is being put on each of the 4 categories. In the simple sum of squares, equal weight is put on each of the cases.
There's a reason to prefer the \( \chi^2 \) statistic to the simple sum of square residuals. The reason is that we actually know what will be the variance of a poisson random variable.
ACTIVITY: Generate a large poisson sample of a given \( \lambda \). Find the standard deviation. Try this for several different \( \lambda \), say \( \lambda=25,100,400,1600,6400 \).
By dividing (observed-expected)2 by expected, we are effectively creating a z2 variable, the square of a normal random variable with mean=0 and sd=1.
There was some controversy in the early days about what the distribution of the sum of \( k \) z2 variables should be. It was called the \( \chi^2 \) distribution, but it was observed that adding up three z2 variables was closer to what was observed for 4 categories in the chi-squared test than adding up four z2 variables.
Of course, making such a finding required that there be some definition of “closer to”. This involved some argument. Remember, in 1900, calculations were difficult. But now it's easy to see that three is better than four. For example, when the distributions match, plotting one versus another (in sorted order) should give a straight line along the diagonal.
z3 = do(1000) * sum(rnorm(3)^2)
z4 = do(1000) * sum(rnorm(4)^2)
plotPoints(sort(z3$result) ~ sort(s2$result))
plotFun(x ~ x, add = TRUE, col = "red")
plotPoints(sort(z4$result) ~ sort(s2$result))
plotFun(x ~ x, add = TRUE, col = "red")
mean(result, data = z3)
## [1] 2.971
mean(result, data = z4)
## [1] 4.009
mean(result, data = s2)
## [1] 3.207
What's going on here is that the four observations had to add up to 515. In other words, the residuals must add up to zero. This is very much what happens when we fit a model: the residuals must add up to zero because we've fitted the model. In this case, we've fitted the expectation so that it must add up to 515. This requirement takes one degree of freedom out of the expected number. Historically, thus was the concept of degrees of freedom born.
In the \( \chi^2 \) test you can see the origins of many of the concepts that have shaped our approach to statistical modeling: residuals, sum of square residuals as a figure of merit, the sampling distribution under a null hypothesis, degrees of freedom.
An example from Snedecor (1989, p. 202) looks at 196 patients classified according to change in health of leprosy patients and degree of skin damage (called “infiltration”).
| Degree of Infiltration | Marked Improvement | Moderate | Slight | Stationary | Worse | TOTAL |
|---|---|---|---|---|---|---|
| Little | 11 | 27 | 42 | 53 | 11 | 144 |
| Much | 7 | 15 | 16 | 13 | 1 | 52 |
| TOTAL | 18 | 42 | 58 | 66 | 12 | 196 |
In our modeling framework, we would likely see whether there is a relationship between “Degree of Infiltration” and Improvement by making a logistic regression model with “Degree of Infiltration” as the response. The \( \chi^2 \) test constructs a null hypothesis probability model by assuming that the two variables are independent and then tests the chi-square statistic against that model. Not much different.
Absent a modeling technique for multi-level response variables, we don't have an exact analog of the chi-squared test using modeling.
Note that the appropriate test here is actually “Fisher's Exact Test”, which avoids an approximation made in the \( \chi^2 \) test.
Here's a simulation example with a continuous predictor from UCLA. The response is the number of awards given to students. The explanatory variables are the score on a math exam and the type of program in which the student was enrolled. The case is an individual student.
p <- read.csv("http://www.ats.ucla.edu/stat/data/poisson_sim.csv")
p <- within(p, {
prog <- factor(prog, levels = 1:3, labels = c("General", "Academic", "Vocational"))
id <- factor(id)
})
summary(p)
## id num_awards prog math
## 1 : 1 Min. :0.00 General : 45 Min. :33.0
## 2 : 1 1st Qu.:0.00 Academic :105 1st Qu.:45.0
## 3 : 1 Median :0.00 Vocational: 50 Median :52.0
## 4 : 1 Mean :0.63 Mean :52.6
## 5 : 1 3rd Qu.:1.00 3rd Qu.:59.0
## 6 : 1 Max. :6.00 Max. :75.0
## (Other):194
And the analysis:
m1 = glm(num_awards ~ prog + math, family = "poisson", data = p)
summary(m1)
##
## Call:
## glm(formula = num_awards ~ prog + math, family = "poisson", data = p)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.204 -0.844 -0.511 0.256 2.680
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.2471 0.6585 -7.97 1.6e-15 ***
## progAcademic 1.0839 0.3583 3.03 0.0025 **
## progVocational 0.3698 0.4411 0.84 0.4018
## math 0.0702 0.0106 6.62 3.6e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 287.67 on 199 degrees of freedom
## Residual deviance: 189.45 on 196 degrees of freedom
## AIC: 373.5
##
## Number of Fisher Scoring iterations: 6
m2 = glm(num_awards ~ prog * math, family = "poisson", data = p)
summary(m2)
##
## Call:
## glm(formula = num_awards ~ prog * math, family = "poisson", data = p)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.229 -0.796 -0.530 0.253 2.683
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.8618 2.4932 -1.55 0.12
## progAcademic -0.4411 2.6030 -0.17 0.87
## progVocational -0.8447 2.8699 -0.29 0.77
## math 0.0440 0.0472 0.93 0.35
## progAcademic:math 0.0284 0.0487 0.58 0.56
## progVocational:math 0.0229 0.0542 0.42 0.67
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 287.67 on 199 degrees of freedom
## Residual deviance: 189.10 on 194 degrees of freedom
## AIC: 377.2
##
## Number of Fisher Scoring iterations: 6
No sign of an interaction.
This corresponds to the interaction term.
Do an example with two categorical variables.
Work through the degrees of freedom.
The techniques that we've studied in this course were born at the end of the 19th century, had their adolescence in the early part of the 20th century. They were developed originally for the analysis of laboratory experiments, but quickly were adopted to the analysis of complex observational studies.
The ideas behind resampling and randomization emerged in the early part of the 20th century, but they were impractical until electronic computers were invented.
Only now, in the 2010s are the computational ideas becoming mainstream in statistical education, so hardly anyone older than you will know about them.
Where things are heading: