# Stats 155 Class Notes 2012-12-10

## A Cultural-Literacy Guide to Statistical Practice

When you go to a different country, it helps to learn some of the basics of the culture: where it's polite to slurp your soup, how many times to kiss when greeting, whether to shake your head to say “yes”.

In this course, you've been learning a particular approach to statistical reasoning based on the idea of fitting models. It's a very general and powerful approach. But it's not the approach that most people learn. You need to be aware of this when working with others: they will find incomprehensible some of the language that you use fluently. In particular:

• “covariate” is not widely understood
• “linear regression”, or just “regression”, is used to mean modeling one continuous variable by another — the simple straight-line model A~B. There is only one explanatory variable in “linear regression”; “multiple regression” is used to describe the technique that we have been using with lm() and multiple variables.
• “interaction” is generally unknown.
• The basic technique, a t-test, is about comparing the means of two groups. We've done this with linear regression, but people will be confused because they were taught that linear regression is something utterly different from a t-test.
• “ANOVA” refers to looking at the differences among several group means, the generalization of the t-test.
• “Bootstrapping”, “resampling”, and “randomization” (what we've been doing with resample() and shuffle()) are obscure terms, not typically understood.
• There are many, many named tests that get applied in different circumstances.
• one-sample t-test: comparing a group mean to a fixed null value
• For us: summary( lm( A-null ~ 1) )
• two-sample t-test: the mean of two groups
• For us: summary( lm( A~G) )
• paired t-test: a t-test applied to naturally paired (e.g. before and after) data
• For us: anova( lm(A~when + pairID) )
• two sample p-test: comparing two proportions.
• For us: summary( lm( A=="Yes" ~ G) )
• equal variance two-sample t-test: comparing the means of two groups
• No exact equivalent for us, but then the test is not really useful for much anyways.
• one-way ANOVA: comparing the means of several groups
• For us: anova( lm( A~G) )
• rank sum test, spearman rank correlation, and other “non-parametric tests”
• For us: take the rank() before fitting the model
• The chi-squared test.
• For us: logistic regression or poisson regression

### The $$\chi^2$$ test (“chi squared”)

Chi-squared is a much beloved test, one of the first statistical tests. There's also a chi-squared distribution, which is analogous to an F distribution or a t distribution and has, as a parameter, “degrees of freedom”. It is associated with Karl Pearson the dean of statistics at the turn of the 19th-20th century.

The $$\chi^2$$ test is used for two distinct but related purposes:

• Testing for “independence” of two variables. Our analogous term is “interaction”.
• Comparing an observed set of frequencies to a model probability.

Example: In genetics, di-hybrid cross involves mating two organisms, one with a dominant trait and one with a recessive trait. Their offspring are then mated, producing 4 phenotypes in the second generation. The second generation has, theoretically, a rate of expression of 9:3:3:1. For instance, in an agricultural experiment, 1301 plants have been grown from a dihybrid cross and the four phenotypes appear like this:

Phenotype Observed Expected in Theory
Green 773 731.9
Golden 231 243.9
Green-striped 238 243.9
Golden-green-striped 59 81.3
TOTAL 1301 1301

It's pretty obvious which phenotype corresponds to the “9” and which to the “1”. The question is whether the observation is consistent with the theory.

This example is drawn from Snedecor and Cochran (1989), Statistical Methods 8th ed. p 196. It draws on work done by Lindstrom (1918). That the example dates from 1918 and is still being used in 1989 says something about the pedigree of the $$\chi^2$$ test. Here are the data in a simple R form

LindstromObs = c(773, 231, 238, 59)
LindstromExp = 1301 * c(9, 3, 3, 1)/16


In this class, we've focussed on fitting models to observed data with the goal of relating explanatory variables to a response. In this dihybrid problem, however, the emphasis is different: we have a model and we want to know if the data are consistent with it.

We've already seen such questions in the context of the Null Hypothesis. So let's let the theory be our null hypothesis and ask if the data are consistent with the null. To do this, we need 1. a test statistic 2. the distribution of the test statistic under the Null.

An obvious test statistic to use is the sum of square residuals. Here's a program to calculate that:

ssResids = function(observed, expected) sum((observed - expected)^2)


Our test statistic:

ts = ssResids(observed = LindstromObs, expected = LindstromExp)
ts

## [1] 2397


To carry out a hypothesis test, we need to be able to generate samples from the Null Hypothesis. Since we know the rate at which each phenotype is being generated, we can use the poisson random number generator.

counts = rpois(4, lambda = LindstromExp)
counts

## [1] 739 229 248  70


This method is good, but it's not necessarily consistent with the total number of observations, 1301. So, a trick:

makeSample = function(expected) {
n = sum(expected)
table(resample(1:length(expected), prob = expected/n, size = round(n)))
}


Now, the sampling distribution under the Null Hypothesis:

s = do(1000) * ssResids(observed = makeSample(LindstromExp), expected = LindstromExp)


And the p-value:

mean(result > ts, data = s)

## [1] 0.051


The observed data are a little bit surprising.

As it happens, the test statistic that we developed historically is a bit different:

chisq.stat = function(observed, expected) {
sum((observed - expected)^2/expected)
}


Running the hypothesis test. Here, the sampling distribution under the null:

s2 = do(1000) * chisq.stat(observed = makeSample(LindstromExp), expected = LindstromExp)
t2 = chisq.stat(observed = LindstromObs, expected = LindstromExp)
mean(result > t2, data = s2)

## [1] 0.031


The p-value is lower than before, in fact below 0.05. (Remember, there's nothing magical about 0.05. It's just a convention.) Quoting Snedecor: “Lindstrom commented that the deviations [from Mendelian genetic theory] could be largely explained by a physiological cause, namely the weakened condition of the last three classes due to their chlorophyll abnormality. He pointed out in particular that the last class (golden-green-striped) was not very vigorous.”

The change in the p-value occurred because the division by expected in the $$\chi^2$$ statistic means that equal weight is being put on each of the 4 categories. In the simple sum of squares, equal weight is put on each of the cases.

There's a reason to prefer the $$\chi^2$$ statistic to the simple sum of square residuals. The reason is that we actually know what will be the variance of a poisson random variable.

ACTIVITY: Generate a large poisson sample of a given $$\lambda$$. Find the standard deviation. Try this for several different $$\lambda$$, say $$\lambda=25,100,400,1600,6400$$.

• How is the standard deviation related to $$\lambda$$?
• How is the variance related to $$\lambda$$?

By dividing (observed-expected)2 by expected, we are effectively creating a z2 variable, the square of a normal random variable with mean=0 and sd=1.

There was some controversy in the early days about what the distribution of the sum of $$k$$ z2 variables should be. It was called the $$\chi^2$$ distribution, but it was observed that adding up three z2 variables was closer to what was observed for 4 categories in the chi-squared test than adding up four z2 variables.

Of course, making such a finding required that there be some definition of “closer to”. This involved some argument. Remember, in 1900, calculations were difficult. But now it's easy to see that three is better than four. For example, when the distributions match, plotting one versus another (in sorted order) should give a straight line along the diagonal.

z3 = do(1000) * sum(rnorm(3)^2)
z4 = do(1000) * sum(rnorm(4)^2)
plotPoints(sort(z3$result) ~ sort(s2$result))
plotFun(x ~ x, add = TRUE, col = "red")


plotPoints(sort(z4$result) ~ sort(s2$result))
plotFun(x ~ x, add = TRUE, col = "red")


mean(result, data = z3)

## [1] 2.971

mean(result, data = z4)

## [1] 4.009

mean(result, data = s2)

## [1] 3.207


What's going on here is that the four observations had to add up to 515. In other words, the residuals must add up to zero. This is very much what happens when we fit a model: the residuals must add up to zero because we've fitted the model. In this case, we've fitted the expectation so that it must add up to 515. This requirement takes one degree of freedom out of the expected number. Historically, thus was the concept of degrees of freedom born.

In the $$\chi^2$$ test you can see the origins of many of the concepts that have shaped our approach to statistical modeling: residuals, sum of square residuals as a figure of merit, the sampling distribution under a null hypothesis, degrees of freedom.

### The $$\chi^2$$ test for independence

An example from Snedecor (1989, p. 202) looks at 196 patients classified according to change in health of leprosy patients and degree of skin damage (called “infiltration”).

Degree of Infiltration Marked Improvement Moderate Slight Stationary Worse TOTAL
Little 11 27 42 53 11 144
Much 7 15 16 13 1 52
TOTAL 18 42 58 66 12 196

In our modeling framework, we would likely see whether there is a relationship between “Degree of Infiltration” and Improvement by making a logistic regression model with “Degree of Infiltration” as the response. The $$\chi^2$$ test constructs a null hypothesis probability model by assuming that the two variables are independent and then tests the chi-square statistic against that model. Not much different.

Absent a modeling technique for multi-level response variables, we don't have an exact analog of the chi-squared test using modeling.

Note that the appropriate test here is actually “Fisher's Exact Test”, which avoids an approximation made in the $$\chi^2$$ test.

#### Models of Counts

Here's a simulation example with a continuous predictor from UCLA. The response is the number of awards given to students. The explanatory variables are the score on a math exam and the type of program in which the student was enrolled. The case is an individual student.

p <- read.csv("http://www.ats.ucla.edu/stat/data/poisson_sim.csv")
p <- within(p, {
prog <- factor(prog, levels = 1:3, labels = c("General", "Academic", "Vocational"))
id <- factor(id)
})
summary(p)

##        id        num_awards           prog          math
##  1      :  1   Min.   :0.00   General   : 45   Min.   :33.0
##  2      :  1   1st Qu.:0.00   Academic  :105   1st Qu.:45.0
##  3      :  1   Median :0.00   Vocational: 50   Median :52.0
##  4      :  1   Mean   :0.63                    Mean   :52.6
##  5      :  1   3rd Qu.:1.00                    3rd Qu.:59.0
##  6      :  1   Max.   :6.00                    Max.   :75.0
##  (Other):194


And the analysis:

m1 = glm(num_awards ~ prog + math, family = "poisson", data = p)
summary(m1)

##
## Call:
## glm(formula = num_awards ~ prog + math, family = "poisson", data = p)
##
## Deviance Residuals:
##    Min      1Q  Median      3Q     Max
## -2.204  -0.844  -0.511   0.256   2.680
##
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)
## (Intercept)     -5.2471     0.6585   -7.97  1.6e-15 ***
## progAcademic     1.0839     0.3583    3.03   0.0025 **
## progVocational   0.3698     0.4411    0.84   0.4018
## math             0.0702     0.0106    6.62  3.6e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
##     Null deviance: 287.67  on 199  degrees of freedom
## Residual deviance: 189.45  on 196  degrees of freedom
## AIC: 373.5
##
## Number of Fisher Scoring iterations: 6

m2 = glm(num_awards ~ prog * math, family = "poisson", data = p)
summary(m2)

##
## Call:
## glm(formula = num_awards ~ prog * math, family = "poisson", data = p)
##
## Deviance Residuals:
##    Min      1Q  Median      3Q     Max
## -2.229  -0.796  -0.530   0.253   2.683
##
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)          -3.8618     2.4932   -1.55     0.12
## progAcademic         -0.4411     2.6030   -0.17     0.87
## progVocational       -0.8447     2.8699   -0.29     0.77
## math                  0.0440     0.0472    0.93     0.35
## progAcademic:math     0.0284     0.0487    0.58     0.56
## progVocational:math   0.0229     0.0542    0.42     0.67
##
## (Dispersion parameter for poisson family taken to be 1)
##
##     Null deviance: 287.67  on 199  degrees of freedom
## Residual deviance: 189.10  on 194  degrees of freedom
## AIC: 377.2
##
## Number of Fisher Scoring iterations: 6


No sign of an interaction.

This corresponds to the interaction term.

Do an example with two categorical variables.

Work through the degrees of freedom.

## On to the Future in Statistics

The techniques that we've studied in this course were born at the end of the 19th century, had their adolescence in the early part of the 20th century. They were developed originally for the analysis of laboratory experiments, but quickly were adopted to the analysis of complex observational studies.

The ideas behind resampling and randomization emerged in the early part of the 20th century, but they were impractical until electronic computers were invented.

Only now, in the 2010s are the computational ideas becoming mainstream in statistical education, so hardly anyone older than you will know about them.

Where things are heading:

• Big data, where we try to glean patterns from large number of cases in observational data.
• Example: You've watched two movies on Netflix. Other people have watched these movies, too. Can we make a good recommendation of movies for you based on the characteristics of the other viewers.
• Example: Your credit card has just been used at a gas station. Might it have been stolen?
• Little $$n$$ big $$m$$. Microarray data, where you may have $$n=10$$ samples with $$m=40000$$ variables.
• Bayesian computation. The Bayesian approach to probability is, to many people, more compelling than the frequentist/hypothesis-testing approach. The computations required are difficult. But they are becoming possible.