Monday May 6, 2013. Stats 155 Class Notes

Last Day: Wrapping Up

0. Mentioning Chi-Squared.

1. A review problem based in ANOVA introducing a new idea:

The rank transform

Earlier in the semester, we spent some time on “robust” versus “non-robust”, what to do about outliers. The rank transform offers a simple way to use the modeling framework and handle issues relating to robustness.

There are other methods, but rank is simple and pretty good.
Other sorts of transformations are used, especially the log transform.
- There's actually a theory of these, but it take a day or so to introduce.
- Logs are appropriate whenever proportional change is relevant, e.g. prices, population growth

2. A little bit more about DAGs and how they relate to experiment

Intervention disconnecting causal connections
Intent to treat and instrumental variables.

Quote for the Day:

“Theorie ist, wenn man alles weiss, aber nichts funktioniert. Praxis ist, wenn alles funktioniert, aber niemand weiss warum. Hier ist Theorie und Praxis vereint: nichts funktioniert… und niemand weiss wieso!” — Albert Einstein

A Scientific Question

In his graduate work at the University of Fairbanks, Alaska, Mike Anderson studied nitrogen fixing by bacteria.

He writes:

My dissertation research centers on symbiotic interactions between alder plants (Alnus spp.) and nitrogen-fixing Frankia bacteria on the 100-mile-wide floodplain of the Tanana River near Fairbanks. In this area the symbiosis with Frankia allows alder to colonize newly-formed river terraces, which are very low in soil nitrogen (N). The competitive edge provided to alder by Frankia during this process is short-lived, however, because the alders quickly enhance the availability of N in the soil, which helps other plant species colonize these sites. Large changes in the community and ecosystem follow over the next ~150 years, which changes the environmental context of the relationship between alder and Frankia. How this relationship responds to these changes is the subject of my dissertation research.

One question in his research is how genetic variation in Frankia influences the amount of of nitrogen in the soil. With great effort, he collected data, which we have in the alder.csv data set.

alder = fetchData("alder.csv")

## Retrieving from http://www.mosaic-web.org/go/datasets/alder.csv

Of particular interest to us today are these variables:

SNF: specific nitrogen fixation
RF: the genetic characterization of the bacteria by restriction fragmentation (thus, “RF”)
ONECM: the temperature measured at 1cm soil depth
FIVECM: the temperature measured at 5cm soid depth
PERH2O: the percentage of the soil that's water
SITE: an identifier for the site location
LAND: the type of the site
SAMPPER: the time in the season when the information was collected. A specific Julian calendar day is given by JULDAY

Let's do a simple test of whether nitrogen fixation is related to the genetic classification of the bacteria:

anova(lm(SNF ~ RF, data = alder))

## Analysis of Variance Table
## 
## Response: SNF
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## RF          5  23646    4729    4.84 0.00037 ***
## Residuals 165 161351     978                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Questions:

How many cases were there?
How many levels of the genetic classification were there?

It's always sensible to look at the distribution of your data. For instance:

bwplot(SNF ~ RF, data = alder)

plot of chunk unnamed-chunk-4

I'm a little bit concerned here about the outliers. Maybe they are skewing the results. Perhaps a single point is being overly influential.

Easy to find out: We'll construct the sampling distribution of F by randomization:

s = do(1000) * anova(lm(SNF ~ shuffle(RF), data = alder))[1, 4]
tally(~result > 4.8, data = s)

## 
##  TRUE FALSE Total 
##     1   999  1000

This is a similar result to what we got parametrically, that is, using the F distribution from theory.

When in doubt about whether a result is being influenced by outliers, you can do the parametric test on the rank of the data.

ACTIVITY:

Show that the rank has a uniform distribution
Make a boxplot and a conditional density plot of the rank SNF

Here's the result on the rank:

anova(lm(rank(SNF) ~ RF + PERH2O, data = alder))

## Analysis of Variance Table
## 
## Response: rank(SNF)
##            Df Sum Sq Mean Sq F value Pr(>F)  
## RF          5  30684    6137    2.01  0.079 .
## PERH2O      1  19047   19047    6.23  0.013 *
## Residuals 189 577714    3057                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Is this p-value right? We can check it by randomization:

s = do(1000) * anova(lm(rank(SNF) ~ shuffle(RF), data = alder))[1, 4]
tally(~result > 1.9538, data = s)

## 
##  TRUE FALSE Total 
##    84   916  1000

Very close!

QUESTION: How should I decide if the result is close to 0.087?

There are other transformations that can be used. Economists often use the logarithm, which makes sense for prices. Biologists sometimes use logs or square roots, and for proportions they use arc-sines. The rank is a good general-purpose solution. It's really only an issue if there are outliers.

Let's stick with the rank and see if we can improve the p-value a bit by including covariates. Maybe water and soil temperature have an effect?

anova(lm(rank(SNF) ~ RF + PERH2O, data = alder))

## Analysis of Variance Table
## 
## Response: rank(SNF)
##            Df Sum Sq Mean Sq F value Pr(>F)  
## RF          5  30684    6137    2.01  0.079 .
## PERH2O      1  19047   19047    6.23  0.013 *
## Residuals 189 577714    3057                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Water is clearly important! But it didn't eat up so much variance that it makes RF look much better.

anova(lm(rank(SNF) ~ RF + ONECM, data = alder))

## Analysis of Variance Table
## 
## Response: rank(SNF)
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## RF          5  30684    6137    2.29   0.048 *  
## ONECM       1  89576   89576   33.38 3.1e-08 ***
## Residuals 189 507185    2684                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Same story.

anova(lm(rank(SNF) ~ RF + FIVECM, data = alder))

## Analysis of Variance Table
## 
## Response: rank(SNF)
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## RF          5  30684    6137    2.11   0.066 .  
## FIVECM      1  47871   47871   16.48 7.2e-05 ***
## Residuals 189 548891    2904                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Why not try both temperatures? Compare these two ANOVA reports

anova(lm(rank(SNF) ~ RF + ONECM + FIVECM, data = alder))

## Analysis of Variance Table
## 
## Response: rank(SNF)
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## RF          5  30684    6137    2.31   0.046 *  
## ONECM       1  89576   89576   33.73 2.7e-08 ***
## FIVECM      1   7945    7945    2.99   0.085 .  
## Residuals 188 499241    2656                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(lm(rank(SNF) ~ RF + FIVECM + ONECM, data = alder))

## Analysis of Variance Table
## 
## Response: rank(SNF)
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## RF          5  30684    6137    2.31   0.046 *  
## FIVECM      1  47871   47871   18.03 3.4e-05 ***
## ONECM       1  49650   49650   18.70 2.5e-05 ***
## Residuals 188 499241    2656                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Maybe the time in the season?

anova(lm(rank(SNF) ~ RF + ONECM + FIVECM + SAMPPER, data = alder))

## Analysis of Variance Table
## 
## Response: rank(SNF)
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## RF          5  30684    6137    3.49  0.0049 ** 
## ONECM       1  89576   89576   50.90 2.1e-11 ***
## FIVECM      1   7945    7945    4.51  0.0349 *  
## SAMPPER     2 171899   85949   48.84 < 2e-16 ***
## Residuals 186 327342    1760                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(lm(rank(SNF) ~ RF + SAMPPER + FIVECM + ONECM, data = alder))

## Analysis of Variance Table
## 
## Response: rank(SNF)
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## RF          5  30684    6137    3.49  0.0049 ** 
## SAMPPER     2 205613  102807   58.42 < 2e-16 ***
## FIVECM      1  55868   55868   31.75 6.4e-08 ***
## ONECM       1   7937    7937    4.51  0.0350 *  
## Residuals 186 327342    1760                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And the water? Maybe the time in the season?

anova(lm(rank(SNF) ~ RF + PERH2O + ONECM + FIVECM + SAMPPER, data = alder))

## Analysis of Variance Table
## 
## Response: rank(SNF)
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## RF          5  30684    6137    3.47  0.0051 ** 
## PERH2O      1  19047   19047   10.76  0.0012 ** 
## ONECM       1  84166   84166   47.57 8.2e-11 ***
## FIVECM      1  14420   14420    8.15  0.0048 ** 
## SAMPPER     2 151789   75894   42.89 5.0e-16 ***
## Residuals 185 327339    1769                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(lm(rank(SNF) ~ RF + SAMPPER + FIVECM + ONECM + PERH2O, data = alder))

## Analysis of Variance Table
## 
## Response: rank(SNF)
##            Df Sum Sq Mean Sq F value Pr(>F)    
## RF          5  30684    6137    3.47 0.0051 ** 
## SAMPPER     2 205613  102807   58.10 <2e-16 ***
## FIVECM      1  55868   55868   31.57  7e-08 ***
## ONECM       1   7937    7937    4.49 0.0355 *  
## PERH2O      1      3       3    0.00 0.9667    
## Residuals 185 327339    1769                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Which is it? Is the water important or not?

Alas, this same problem appears with RF itself.

anova(lm(rank(SNF) ~ RF + PERH2O + ONECM + FIVECM + SAMPPER, data = alder))

## Analysis of Variance Table
## 
## Response: rank(SNF)
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## RF          5  30684    6137    3.47  0.0051 ** 
## PERH2O      1  19047   19047   10.76  0.0012 ** 
## ONECM       1  84166   84166   47.57 8.2e-11 ***
## FIVECM      1  14420   14420    8.15  0.0048 ** 
## SAMPPER     2 151789   75894   42.89 5.0e-16 ***
## Residuals 185 327339    1769                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(lm(rank(SNF) ~ PERH2O + ONECM + FIVECM + SAMPPER + RF, data = alder))

## Analysis of Variance Table
## 
## Response: rank(SNF)
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## PERH2O      1  17320   17320    9.79   0.002 ** 
## ONECM       1  81807   81807   46.23 1.4e-10 ***
## FIVECM      1  33114   33114   18.71 2.5e-05 ***
## SAMPPER     2 160314   80157   45.30 < 2e-16 ***
## RF          5   7551    1510    0.85   0.514    
## Residuals 185 327339    1769                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

DISCUSSION: So is `RF` important or not?

There's evidence to support a claim that RF is related to nitrogen fixing.
There's evidence to support a claim that RF is not directly related to nitrogen fixing.
In everyday language, what is it about the biological/ecological/geological situation that creates the ambiguity? [That RF depends on SITE and SITE is strongly related to water, temperature, …]
Construct a model or other statistical analysis that supports your hypothesis in (3).

There's probably no compelling way to resolve the ambiguity about RF from these data. You can construct a theory in the form of a model of how the system works and then construct your model based on the theory, blocking backdoor pathways as appropriate. But, as Einstein said:

A theory is something nobody believes, except the person who made it.

Much more convincing to construct an experiment. How might this be done with the alder situation?

Not clear!
Try to innoculate the area with one bacterium or another?
Change the water? The soil temperature?
Given that we have reason to believe that location, water, temperature, etc. are important, we will want to arrange the different bacteria so that the space spanned by the variable is as orthogonal as possible to the quantities we measure.

Review blocking in this context: Divide up the plots into groups. Within each group, all the plots are similar. Then randomly assign the bacteria (or water, or soil temperature) within each block.

To give the whole of Einstein's quote: A theory is something nobody believes, except the person who made it. An experiment is something everybody believes, except the person who made it.

The Aspirin Experiment

First, a preliminary observational study of 1000 people.

fetchData("simulate.r")

## [1] TRUE

Obs = run.sim(aspirin, 1000)
coef(summary(glm(stroke == "Y" ~ mgPerDay, data = Obs, family = "binomial")))

##              Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -0.911949    0.12127   -7.52 5.482e-14
## mgPerDay    -0.003871    0.00225   -1.72 8.536e-02

coef(summary(glm(stroke == "Y" ~ mgPerDay + sick, data = Obs, family = "binomial")))

##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -1.86383   0.184014 -10.129 4.119e-24
## mgPerDay    -0.04402   0.005265  -8.361 6.242e-17
## sick         0.79704   0.053974  14.767 2.388e-49

Without including the covariate sick, we can't see any effect of aspirin. With sick in place, it looks like aspirin is somewhat protective.

QUESTION: How does aspirin change the odds of having a stroke? [Ans. : Pick a dosage for aspirin, multiply by the coefficient, and undo the log.]

But we're going to run into problems with getting people to accept the results of this study:

They will argue that some other variable might be at work.
They will challenge how we measure sick in practice.
They will say that the study is overly simplistic.

Questions we can answer without problem:

Is there evidence for a relationship?
Is the magnitude of the relationship substantial?

But to convince people, let's do an experiment.

Step 1: Decide what dose to give people.
Step 2: Run the experiment:

Ex = run.sim(aspirin, 1000, mgPerDay = c(0, 100))
coef(summary(glm(stroke == "Y" ~ mgPerDay, data = Ex, family = "binomial")))

##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -0.36397   0.090928  -4.003 6.260e-05
## mgPerDay    -0.01418   0.001565  -9.064 1.261e-19

coef(summary(glm(stroke == "Y" ~ mgPerDay + sick, data = Ex, family = "binomial")))

##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -1.76333    0.14325 -12.309 8.068e-35
## mgPerDay    -0.03593    0.00373  -9.633 5.816e-22
## sick         0.69075    0.05004  13.805 2.373e-43

The coefficient changes. In a linear model, it wouldn't change this much. But nonlinear models are more complicated.

Is there evidence for an interaction?

coef(summary(glm(stroke == "Y" ~ mgPerDay * sick, data = Ex, family = "binomial")))

##                 Estimate Std. Error  z value  Pr(>|z|)
## (Intercept)   -1.7323146   0.152285 -11.3755 5.541e-30
## mgPerDay      -0.0386253   0.006298  -6.1327 8.638e-10
## sick           0.6663677   0.064799  10.2837 8.348e-25
## mgPerDay:sick  0.0005739   0.001028   0.5582 5.767e-01

Nah.

The experiment is pretty clean. We could have used many fewer people: smaller \( n \).

Intent to Treat

In reality, we're not going to have perfect compliance with our instructions. Some people given a placebo will take aspirin anyways. Some people won't take their aspirin.

To simulate this, we'll tell the program to add in an influence without severing the connection from sick to mgPerDay.

experiment.size = 200
influence = resample(c(30, 70), experiment.size)
Ex2 = run.sim(aspirin, experiment.size, mgPerDay = influence, inject = TRUE)

Show that with inject=TRUE, the values of mgPerDay reflect both the input and the variable sick

Now build the model:

coef(summary(glm(stroke == "Y" ~ mgPerDay, data = Ex2, family = "binomial")))

##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.00171   0.556046  -1.801  0.07163
## mgPerDay    -0.01034   0.005946  -1.739  0.08199

coef(summary(glm(stroke == "Y" ~ mgPerDay + sick, data = Ex2, family = "binomial")))

##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept)  -2.3975    0.98430  -2.436 1.486e-02
## mgPerDay     -0.0357    0.01021  -3.498 4.683e-04
## sick          0.7052    0.13079   5.392 6.967e-08

The result is still ambiguous. (A larger experiment would fix this)

Intent to treat means to use the assigned values rather than the measured values.

coef(summary(glm(stroke == "Y" ~ influence, data = Ex2, family = "binomial")))

##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept)  0.33223    0.60902  0.5455 0.5854037
## influence   -0.05411    0.01585 -3.4143 0.0006395

coef(summary(glm(stroke == "Y" ~ influence + sick, data = Ex2, family = "binomial")))

##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -2.74767    1.04113  -2.639 8.312e-03
## influence   -0.05859    0.01813  -3.232 1.228e-03
## sick         0.59521    0.11949   4.981 6.312e-07

This idea of using intent rather than the actual treatment is counter-intuitive. But it's easily understood in terms of the causal diagrams. There's no back-door pathway from intent to stroke.

A Cultural-Literacy Guide to Statistical Practice

When you go to a different country, it helps to learn some of the basics of the culture: where it's polite to slurp your soup, how many times to kiss when greeting, whether to shake your head to say “yes”.

In this course, you've been learning a particular approach to statistical reasoning based on the idea of fitting models. It's a very general and powerful approach. But it's not the approach that most people learn. You need to be aware of this when working with others: they will find incomprehensible some of the language that you use fluently. In particular:

“covariate” is not widely understood
“linear regression”, or just “regression”, is used to mean modeling one continuous variable by another — the simple straight-line model A~B. There is only one explanatory variable in “linear regression”; “multiple regression” is used to describe the technique that we have been using with lm() and multiple variables.
“interaction” is generally unknown.
The basic technique, a t-test, is about comparing the means of two groups. We've done this with linear regression, but people will be confused because they were taught that linear regression is something utterly different from a t-test.
“ANOVA” refers to looking at the differences among several group means, the generalization of the t-test.
“Bootstrapping”, “resampling”, and “randomization” (what we've been doing with resample() and shuffle()) are obscure terms, not typically understood.
There are many, many named tests that get applied in different circumstances.
- one-sample t-test: comparing a group mean to a fixed null value
  - For us: summary( lm( A-null ~ 1) )
- two-sample t-test: the mean of two groups
  - For us: summary( lm( A~G) )
- paired t-test: a t-test applied to naturally paired (e.g. before and after) data
  - For us: anova( lm(A~when + pairID) )
- two sample p-test: comparing two proportions.
  - For us: summary( lm( A=="Yes" ~ G) )
- equal variance two-sample t-test: comparing the means of two groups
  - No exact equivalent for us, but then the test is not really useful for much anyways.
- one-way ANOVA: comparing the means of several groups
  - For us: anova( lm( A~G) )
- rank sum test, spearman rank correlation, and other “non-parametric tests”
  - For us: take the rank() before fitting the model
- The chi-squared test.
  - For us: logistic regression or poisson regression

The \( \chi^2 \) test (“chi squared”)

Chi-squared is a much beloved test, one of the first statistical tests. There's also a chi-squared distribution, which is analogous to an F distribution or a t distribution and has, as a parameter, “degrees of freedom”. It is associated with Karl Pearson the dean of statistics at the turn of the 19th-20th century.

The \( \chi^2 \) test is used for two distinct but related purposes:

Testing for “independence” of two variables. Our analogous term is “interaction”.
Comparing an observed set of frequencies to a model probability.

Example: In genetics, di-hybrid cross involves mating two organisms, one with a dominant trait and one with a recessive trait. Their offspring are then mated, producing 4 phenotypes in the second generation. The second generation has, theoretically, a rate of expression of 9:3:3:1. For instance, in an agricultural experiment, 1301 plants have been grown from a dihybrid cross and the four phenotypes appear like this:

Phenotype	Observed	Expected in Theory
Green	773	731.9
Golden	231	243.9
Green-striped	238	243.9
Golden-green-striped	59	81.3
TOTAL	1301	1301

It's pretty obvious which phenotype corresponds to the “9” and which to the “1”. The question is whether the observation is consistent with the theory.

This example is drawn from Snedecor and Cochran (1989), Statistical Methods 8th ed. p 196. It draws on work done by Lindstrom (1918). That the example dates from 1918 and is still being used in 1989 says something about the pedigree of the \( \chi^2 \) test. Here are the data in a simple R form

LindstromObs = c(773, 231, 238, 59)
LindstromExp = 1301 * c(9, 3, 3, 1)/16

In this class, we've focussed on fitting models to observed data with the goal of relating explanatory variables to a response. In this dihybrid problem, however, the emphasis is different: we have a model and we want to know if the data are consistent with it.

We've already seen such questions in the context of the Null Hypothesis. So let's let the theory be our null hypothesis and ask if the data are consistent with the null. To do this, we need 1. a test statistic 2. the distribution of the test statistic under the Null.

An obvious test statistic to use is the sum of square residuals. Here's a program to calculate that:

ssResids = function(observed, expected) sum((observed - expected)^2)

Our test statistic:

ts = ssResids(observed = LindstromObs, expected = LindstromExp)
ts

## [1] 2397

To carry out a hypothesis test, we need to be able to generate samples from the Null Hypothesis. Since we know the rate at which each phenotype is being generated, we can use the poisson random number generator.

counts = rpois(4, lambda = LindstromExp)
counts

## [1] 732 264 287  75

This method is good, but it's not necessarily consistent with the total number of observations, 1301. So, a trick:

makeSample = function(expected) {
    n = sum(expected)
    table(resample(1:length(expected), prob = expected/n, size = round(n)))
}

Now, the sampling distribution under the Null Hypothesis:

s = do(1000) * ssResids(observed = makeSample(LindstromExp), expected = LindstromExp)

And the p-value:

mean(result > ts, data = s)

## [1] 0.034

The observed data are a little bit surprising.

As it happens, the test statistic that we developed historically is a bit different:

chisq.stat = function(observed, expected) {
    sum((observed - expected)^2/expected)
}

Running the hypothesis test. Here, the sampling distribution under the null:

s2 = do(1000) * chisq.stat(observed = makeSample(LindstromExp), expected = LindstromExp)
t2 = chisq.stat(observed = LindstromObs, expected = LindstromExp)
mean(result > t2, data = s2)

## [1] 0.029

The p-value is lower than before, in fact below 0.05. (Remember, there's nothing magical about 0.05. It's just a convention.) Quoting Snedecor: “Lindstrom commented that the deviations [from Mendelian genetic theory] could be largely explained by a physiological cause, namely the weakened condition of the last three classes due to their chlorophyll abnormality. He pointed out in particular that the last class (golden-green-striped) was not very vigorous.”

The change in the p-value occurred because the division by expected in the \( \chi^2 \) statistic means that equal weight is being put on each of the 4 categories. In the simple sum of squares, equal weight is put on each of the cases.

There's a reason to prefer the \( \chi^2 \) statistic to the simple sum of square residuals. The reason is that we actually know what will be the variance of a poisson random variable.

ACTIVITY: Generate a large poisson sample of a given \( \lambda \). Find the standard deviation. Try this for several different \( \lambda \), say \( \lambda=25,100,400,1600,6400 \).

How is the standard deviation related to \( \lambda \)?
How is the variance related to \( \lambda \)?

By dividing (observed-expected)² by expected, we are effectively creating a z² variable, the square of a normal random variable with mean=0 and sd=1.

There was some controversy in the early days about what the distribution of the sum of \( k \) z² variables should be. It was called the \( \chi^2 \) distribution, but it was observed that adding up three z² variables was closer to what was observed for 4 categories in the chi-squared test than adding up four z² variables.

Of course, making such a finding required that there be some definition of “closer to”. This involved some argument. Remember, in 1900, calculations were difficult. But now it's easy to see that three is better than four. For example, when the distributions match, plotting one versus another (in sorted order) should give a straight line along the diagonal.

z3 = do(1000) * sum(rnorm(3)^2)
z4 = do(1000) * sum(rnorm(4)^2)
plotPoints(sort(z3$result) ~ sort(s2$result))
plotFun(x ~ x, add = TRUE, col = "red")

plot of chunk unnamed-chunk-30

plotPoints(sort(z4$result) ~ sort(s2$result))
plotFun(x ~ x, add = TRUE, col = "red")

plot of chunk unnamed-chunk-30

mean(result, data = z3)

## [1] 2.956

mean(result, data = z4)

## [1] 3.961

mean(result, data = s2)

## [1] 2.969

What's going on here is that the four observations had to add up to 515. In other words, the residuals must add up to zero. This is very much what happens when we fit a model: the residuals must add up to zero because we've fitted the model. In this case, we've fitted the expectation so that it must add up to 515. This requirement takes one degree of freedom out of the expected number. Historically, thus was the concept of degrees of freedom born.

In the \( \chi^2 \) test you can see the origins of many of the concepts that have shaped our approach to statistical modeling: residuals, sum of square residuals as a figure of merit, the sampling distribution under a null hypothesis, degrees of freedom.

The \( \chi^2 \) test for independence

An example from Snedecor (1989, p. 202) looks at 196 patients classified according to change in health of leprosy patients and degree of skin damage (called “infiltration”).

Degree of Infiltration	Marked Improvement	Moderate	Slight	Stationary	Worse	TOTAL
Little	11	27	42	53	11	144
Much	7	15	16	13	1	52
TOTAL	18	42	58	66	12	196

In our modeling framework, we would likely see whether there is a relationship between “Degree of Infiltration” and Improvement by making a logistic regression model with “Degree of Infiltration” as the response. The \( \chi^2 \) test constructs a null hypothesis probability model by assuming that the two variables are independent and then tests the chi-square statistic against that model. Not much different.

Absent a modeling technique for multi-level response variables, we don't have an exact analog of the chi-squared test using modeling.

Note that the appropriate test here is actually “Fisher's Exact Test”, which avoids an approximation made in the \( \chi^2 \) test.

Models of Counts

Here's a simulation example with a continuous predictor from UCLA. The response is the number of awards given to students. The explanatory variables are the score on a math exam and the type of program in which the student was enrolled. The case is an individual student.

p <- read.csv("http://www.ats.ucla.edu/stat/data/poisson_sim.csv")
p <- within(p, {
    prog <- factor(prog, levels = 1:3, labels = c("General", "Academic", "Vocational"))
    id <- factor(id)
})
summary(p)

##        id        num_awards           prog          math     
##  1      :  1   Min.   :0.00   General   : 45   Min.   :33.0  
##  2      :  1   1st Qu.:0.00   Academic  :105   1st Qu.:45.0  
##  3      :  1   Median :0.00   Vocational: 50   Median :52.0  
##  4      :  1   Mean   :0.63                    Mean   :52.6  
##  5      :  1   3rd Qu.:1.00                    3rd Qu.:59.0  
##  6      :  1   Max.   :6.00                    Max.   :75.0  
##  (Other):194

And the analysis:

m1 = glm(num_awards ~ prog + math, family = "poisson", data = p)
summary(m1)

## 
## Call:
## glm(formula = num_awards ~ prog + math, family = "poisson", data = p)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.204  -0.844  -0.511   0.256   2.680  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -5.2471     0.6585   -7.97  1.6e-15 ***
## progAcademic     1.0839     0.3583    3.03   0.0025 ** 
## progVocational   0.3698     0.4411    0.84   0.4018    
## math             0.0702     0.0106    6.62  3.6e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 287.67  on 199  degrees of freedom
## Residual deviance: 189.45  on 196  degrees of freedom
## AIC: 373.5
## 
## Number of Fisher Scoring iterations: 6

m2 = glm(num_awards ~ prog * math, family = "poisson", data = p)
summary(m2)

## 
## Call:
## glm(formula = num_awards ~ prog * math, family = "poisson", data = p)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.229  -0.796  -0.530   0.253   2.683  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)          -3.8618     2.4932   -1.55     0.12
## progAcademic         -0.4411     2.6030   -0.17     0.87
## progVocational       -0.8447     2.8699   -0.29     0.77
## math                  0.0440     0.0472    0.93     0.35
## progAcademic:math     0.0284     0.0487    0.58     0.56
## progVocational:math   0.0229     0.0542    0.42     0.67
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 287.67  on 199  degrees of freedom
## Residual deviance: 189.10  on 194  degrees of freedom
## AIC: 377.2
## 
## Number of Fisher Scoring iterations: 6

No sign of an interaction.

This corresponds to the interaction term.

Do an example with two categorical variables.

Work through the degrees of freedom.

On to the Future in Statistics

The techniques that we've studied in this course were born at the end of the 19th century, had their adolescence in the early part of the 20th century. They were developed originally for the analysis of laboratory experiments, but quickly were adopted to the analysis of complex observational studies.

The ideas behind resampling and randomization emerged in the early part of the 20th century, but they were impractical until electronic computers were invented.

Only now, in the 2010s are the computational ideas becoming mainstream in statistical education, so hardly anyone older than you will know about them.

Where things are heading:

Big data, where we try to glean patterns from large number of cases in observational data.
- Example: You've watched two movies on Netflix. Other people have watched these movies, too. Can we make a good recommendation of movies for you based on the characteristics of the other viewers.
- Example: Your credit card has just been used at a gas station. Might it have been stolen?
Little \( n \) big \( m \). Microarray data, where you may have \( n=10 \) samples with \( m=40000 \) variables.
Bayesian computation. The Bayesian approach to probability is, to many people, more compelling than the frequentist/hypothesis-testing approach. The computations required are difficult. But they are becoming possible.