If you were fortunate to learn statistics using Andy Field’s excellent Discovering Statistics Using R/SPSS/SAS series, then you are already well aware that almost every test in classical (parametric) statistics education is now understood to be part of the general linear model or GLM. It turns out that focussing on the deeper GLM lurking everywhere gives us a convenient way to extend all these classical tests to the repeated measures case. For \(t\)-tests and ANOVA, the equivalent model is indeed just a linear mixed-effects model, so I’ll leave that alone for now, but you this follows from all the pieces:
For the other models and tests, it makes sense to start with the fixed-effects (i.e. not mixed) models until we’re used to dealing with them. The mixed-effects models are a relatively straightforward extension, so we’ll just leave them alone for now. A lot of the frequenstist examples here were inspired by Bayesian examples from Rasmus Bååth and his Bayesian First Aid package. It’s good work, but I would prefer estimation to testing, so I have mixed feelings about displaying the models as tests.
Beyond the “estimating is better than testing” perspective and the difficulties of finding repeated-measures equivalents for a lot of these tests, there is anohter advantage: the explicit GLM-based modeling strategy allows for continuous predictors, while many classical tests do not.
For a lot of the examples today, we’ll be using the survey data from the MASS package.
library(MASS)
library(car)
Loading required package: carData
# use ANOVA-style contrasts
options(contrasts=c("contr.Sum", "contr.poly"))
library(tidyverse)
── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1 ✔ purrr 0.2.5
✔ tibble 1.4.2 ✔ dplyr 0.7.5
✔ tidyr 0.8.1 ✔ stringr 1.3.1
✔ readr 1.1.1 ✔ forcats 0.3.0
── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ dplyr::recode() masks car::recode()
✖ dplyr::select() masks MASS::select()
✖ purrr::some() masks car::some()
?survey
| survey | R Documentation |
This data frame contains the responses of 237 Statistics I students at the University of Adelaide to a number of questions.
survey
The components of the data frame are:
Sex
The sex of the student. (Factor with levels “Male” and “Female”.)
Wr.Hnd
span (distance from tip of thumb to tip of little finger of spread hand) of writing hand, in centimetres.
NW.Hnd
span of non-writing hand.
W.Hnd
writing hand of student. (Factor, with levels “Left” and “Right”.)
Fold
“Fold your arms! Which is on top” (Factor, with levels “R on L”, “L on R”, “Neither”.)
Pulse
pulse rate of student (beats per minute).
Clap
‘Clap your hands! Which hand is on top?’ (Factor, with levels “Right”, “Left”, “Neither”.)
Exer
how often the student exercises. (Factor, with levels “Freq” (frequently), “Some”, “None”.)
Smoke
how much the student smokes. (Factor, levels “Heavy”, “Regul” (regularly), “Occas” (occasionally), “Never”.)
Height
height of the student in centimetres.
M.I
whether the student expressed height in imperial (feet/inches) or metric (centimetres/metres) units. (Factor, levels “Metric”, “Imperial”.)
Age
age of the student in years.
Venables, W. N. and Ripley, B. D. (1999) Modern Applied Statistics with S-PLUS. Third Edition. Springer.
For everything but the \(\chi^2\)-Test / Poisson model, there are similar (non repeated-measures) tests implemented in a Bayesian framework as part of Rasmus Bååth’s BayesianFirstAid package. His blog is generally worth a look, and if you’re joining us in a few weeks for the Bayes course, a must read. Similar thoughts in a more comprehensive format yet still very readable if you have the time can be found in McElreath’s book Statistical Rethinking. McElreath also gets more into the deeper relationship between the various distributions and thus tests – beyond all belonging to the exponential family, many of the distributions can be thought of as limiting cases of one another (e.g. the normal distribution is like a binomial distribution with an infinite number of trials). Some of these are hinted in the R documentation for the classical tests.
smoke_exercise <- table(Smoke=survey$Smoke, Exer=survey$Exer)
smoke_exercise.chisq <- chisq.test(smoke_exercise, correct=FALSE)
## Warning in chisq.test(smoke_exercise, correct = FALSE): Chi-squared
## approximation may be incorrect
print(smoke_exercise.chisq)
##
## Pearson's Chi-squared test
##
## data: smoke_exercise
## X-squared = 5.4885, df = 6, p-value = 0.4828
We are unable to reject the null hypothesis. But that doesn’t tell us much. The chisq.test actually gives a lot more information than most people realize.
The observed data:
smoke_exercise.chisq$observed
| Smoke/Exer | Freq | None | Some |
|---|---|---|---|
| Heavy | 7 | 1 | 3 |
| Never | 87 | 18 | 84 |
| Occas | 12 | 3 | 4 |
| Regul | 9 | 1 | 7 |
The expected results based on the null model:
smoke_exercise.chisq$expected
| Freq | None | Some | |
|---|---|---|---|
| Heavy | 5.360169 | 1.072034 | 4.567797 |
| Never | 92.097458 | 18.419491 | 78.483051 |
| Occas | 9.258475 | 1.851695 | 7.889831 |
| Regul | 8.283898 | 1.656780 | 7.059322 |
And the Pearson residuals ( (observed - expected) / sqrt(expected) )
smoke_exercise.chisq$residuals
| Smoke/Exer | Freq | None | Some |
|---|---|---|---|
| Heavy | 0.7082877 | -0.0695717 | -0.7335612 |
| Never | -0.5311654 | -0.0977427 | 0.6227461 |
| Occas | 0.9009954 | 0.8438642 | -1.3848312 |
| Regul | 0.2488040 | -0.5102551 | -0.0223272 |
Looking at the differences and the standardized residuals (which are somewhat analogous to \(z\)-scores), it’s no surprise that we can’t reject the null hypothesis.
But I’ve always found the \(\chi^2\)-test confusing for anything but the simplest designs – a significant result suggest that there’s a difference somewhere, but where? I find it much easier to reason in terms of which things actually make a difference and what that difference looks like. We can do that by using the GLM explicitly.
The Poisson distribution is often used for modelling count data. There are some issues with Poisson models for data with too many zeros (so-called Zero-Inflated Poissons or ZIP) or for data where the variance is too (overdispersion), but we’ll leave those issues alone for today.
The basic Poisson model for our data is easy to compute, but first we need to get our data in the right format:
dat <- as.data.frame(smoke_exercise)
print(dat)
## Smoke Exer Freq
## 1 Heavy Freq 7
## 2 Never Freq 87
## 3 Occas Freq 12
## 4 Regul Freq 9
## 5 Heavy None 1
## 6 Never None 18
## 7 Occas None 3
## 8 Regul None 1
## 9 Heavy Some 3
## 10 Never Some 84
## 11 Occas Some 4
## 12 Regul Some 7
Note that the the Freq column refers to the number of “matches” for a particular combination of smoker type Smoke and excercise regularity Exer and is not to be confused with the entry Freq in the Exer column, which refers to frequent exercise.
The model itself is for predicting how often we observe a particular combination of Smoke and Exer, so they form our predictors and Freq forms our prediction. We use the “over-parameterized” version of the model without an intercept so that each category is explicit and no category is “hidden” in the intercept.
smoke_exercise.glm <- glm(Freq ~ 0 + Smoke * Exer, data=dat, family=poisson())
summary(smoke_exercise.glm)
##
## Call:
## glm(formula = Freq ~ 0 + Smoke * Exer, family = poisson(), data = dat)
##
## Deviance Residuals:
## [1] 0 0 0 0 0 0 0 0 0 0 0 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## SmokeHeavy 1.01484 0.40499 2.506 0.012217 *
## SmokeNever 3.92903 0.09366 41.949 < 2e-16 ***
## SmokeOccas 1.65660 0.27217 6.087 1.15e-09 ***
## SmokeRegul 1.38104 0.37327 3.700 0.000216 ***
## Exer[S.Freq] 0.77811 0.17721 4.391 1.13e-05 ***
## Exer[S.None] -0.99813 0.27186 -3.672 0.000241 ***
## Smoke[S.Heavy]:Exer[S.Freq] 0.15296 0.37044 0.413 0.679662
## Smoke[S.Never]:Exer[S.Freq] -0.24123 0.19418 -1.242 0.214124
## Smoke[S.Occas]:Exer[S.Freq] 0.05020 0.28693 0.175 0.861128
## Smoke[S.Heavy]:Exer[S.None] -0.01671 0.56796 -0.029 0.976534
## Smoke[S.Never]:Exer[S.None] -0.04053 0.29589 -0.137 0.891060
## Smoke[S.Occas]:Exer[S.None] 0.44014 0.40804 1.079 0.280735
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1.3554e+03 on 12 degrees of freedom
## Residual deviance: -1.2212e-14 on 0 degrees of freedom
## AIC: 70.569
##
## Number of Fisher Scoring iterations: 3
We also compute a null model. The relevant null here is “no interaction” (it is after all the \(\chi^2\) test for independence, i.e. not-interacting).
smoke_exercise.null <- glm(Freq ~ 0 + Smoke + Exer, data=dat, family=poisson())
summary(smoke_exercise.null)
##
## Call:
## glm(formula = Freq ~ 0 + Smoke + Exer, family = poisson(), data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.53148 -0.53993 -0.04637 0.63077 0.86126
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## SmokeHeavy 1.08919 0.30595 3.560 0.000371 ***
## SmokeNever 3.93305 0.08936 44.015 < 2e-16 ***
## SmokeOccas 1.63574 0.23521 6.954 3.54e-12 ***
## SmokeRegul 1.52451 0.24803 6.147 7.92e-10 ***
## Exer[S.Freq] 0.58980 0.09914 5.949 2.70e-09 ***
## Exer[S.None] -1.01964 0.14637 -6.966 3.25e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1355.4456 on 12 degrees of freedom
## Residual deviance: 5.8015 on 6 degrees of freedom
## AIC: 64.37
##
## Number of Fisher Scoring iterations: 4
The \(\chi^2\) for independence is then equal to the likelihood test for whether the interaction model fits the data better than the null model:
anova(smoke_exercise.null, smoke_exercise.glm, test="LRT")
| Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
|---|---|---|---|---|
| 6 | 5.801467 | NA | NA | NA |
| 0 | 0.000000 | 6 | 5.801467 | 0.4457935 |
The \(p\)-values here aren’t identical to ones above, but they’re very close. (The deviance differs slightly.) And it’s no coincendence that the \(\chi^2\) statistic occurs in both the likelihood-ratio test and the test of (marginal) independence!
The car package provides a convenience function for computing this type of model.
Anova(smoke_exercise.glm, test.statistic="LR", type = 2)
| LR Chisq | Df | Pr(>Chisq) | |
|---|---|---|---|
| Smoke | 643.052690 | 4 | 0.0000000 |
| Exer | 73.837138 | 2 | 0.0000000 |
| Smoke:Exer | 5.801467 | 6 | 0.4457935 |
We can also predict from the null model:
dat$h0 <- predict(smoke_exercise.null, dat, type="response")
xtabs(h0 ~ Smoke + Exer, dat)
| Smoke/Exer | Freq | None | Some |
|---|---|---|---|
| Heavy | 5.360169 | 1.072034 | 4.567797 |
| Never | 92.097458 | 18.419491 | 78.483051 |
| Occas | 9.258475 | 1.851695 | 7.889831 |
| Regul | 8.283898 | 1.656780 | 7.059322 |
And from the full model:
dat$h1 <- predict(smoke_exercise.glm, dat, type="response")
xtabs(h1 ~ Smoke + Exer, dat)
| Smoke/Exer | Freq | None | Some |
|---|---|---|---|
| Heavy | 7 | 1 | 3 |
| Never | 87 | 18 | 84 |
| Occas | 12 | 3 | 4 |
| Regul | 9 | 1 | 7 |
This model nails our original data nearly perfectly. Now the reason why this isn’t significantly better becomes apparent when we look at the standard errors:
dat$h1.se <- predict(smoke_exercise.glm, dat, type="response", se.fit = TRUE)$se.fit
xtabs(h1.se ~ Smoke + Exer, dat)
| Smoke/Exer | Freq | None | Some |
|---|---|---|---|
| Heavy | 2.645751 | 0.9999952 | 1.732051 |
| Never | 9.327379 | 4.2426407 | 9.165151 |
| Occas | 3.464102 | 1.7320507 | 2.000000 |
| Regul | 3.000000 | 0.9999952 | 2.645751 |
By the way, we can see that our null model is indeed similar to the classical test’s null model:
xtabs(h0 ~ Smoke + Exer, dat) - smoke_exercise.chisq$expected
| Smoke/Exer | Freq | None | Some |
|---|---|---|---|
| Heavy | 0 | 0 | 0 |
| Never | 0 | 0 | 0 |
| Occas | 0 | 0 | 0 |
| Regul | 0 | 0 | 0 |
| That’s a diff | erence | of one | tenth of one billionth of an occurrence at its greatest. |
And the same holds for the standard errors:
dat$h0.se <- predict(smoke_exercise.glm, dat, type="response", se.fit = TRUE)$se.fit
xtabs(h0.se ~ Smoke + Exer, dat) - smoke_exercise.chisq$stdres
| Smoke/Exer | Freq | None | Some |
|---|---|---|---|
| Heavy | 1.632685 | 1.0749956 | 2.714517 |
| Never | 10.989644 | 4.4731867 | 7.340270 |
| Occas | 2.151865 | 0.8057224 | 3.888598 |
| Regul | 2.639293 | 1.5575498 | 2.676061 |
And of course, we can visualize all of this:
ggplot(dat,
aes(x=Smoke,color=Exer,y=h1,ymin=h1-2*h1.se,ymax=h1+2*h1.se)) +
geom_point(aes(y=Freq, shape="Data")) +
geom_point(aes(y=h1, shape="Model")) +
geom_errorbar() +
geom_line(aes(group=Exer)) +
labs(x="Smoking habits", y="Number of Respondents", color="Exercise habits",
shape="Point estimate",
title="Poisson model investigating the independence of smoking and exercise habits",
subtitle="Error bars represent 95% Wald prediction intervals") +
theme_light()
The binomial test is often used for comparing the number of “succeses” (hits, ‘yes’ answers, etc.) to “failures”, where the null hypothesis that the probability of success is 0.5 (by default; you can change this manually). In R, it can be computed as number of success and total number:
binom.test(50, 100)
##
## Exact binomial test
##
## data: 50 and 100
## number of successes = 50, number of trials = 100, p-value = 1
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.3983211 0.6016789
## sample estimates:
## probability of success
## 0.5
Or as number of success vs. number of failures:
binom.test(c(50,50))
##
## Exact binomial test
##
## data: c(50, 50)
## number of successes = 50, number of trials = 100, p-value = 1
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.3983211 0.6016789
## sample estimates:
## probability of success
## 0.5
Note the very subtle difference in the input!
For a more complex example, we can examine handedness.
summary(survey$W.Hnd)
## Left Right NA's
## 18 218 1
to be more interesting, let’s set our null hypothesis to “probability of being right-handed is 99%” – i.e. that lefties are exceptionally rare. The classical test gives us:
hand.binom <- binom.test(c(218,18),p=.99)
hand.binom
##
## Exact binomial test
##
## data: c(218, 18)
## number of successes = 218, number of trials = 236, p-value =
## 5.234e-11
## alternative hypothesis: true probability of success is not equal to 0.99
## 95 percent confidence interval:
## 0.8821360 0.9541722
## sample estimates:
## probability of success
## 0.9237288
For the exact equivalent to the classical binomial test, we only care what the probability of the Bernoulli trial is and whether it differs from our null hypothesis. This corresponds to an intercept only binomial model. We’ll use the logit link function, but this method can be easily adapted for probit regression.
hand.glm <- glm(W.Hnd ~ 1, data=survey, family=binomial(link="logit"))
summary(hand.glm)
##
## Call:
## glm(formula = W.Hnd ~ 1, family = binomial(link = "logit"), data = survey)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2687 0.3983 0.3983 0.3983 0.3983
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.4941 0.2452 10.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 127.24 on 235 degrees of freedom
## Residual deviance: 127.24 on 235 degrees of freedom
## (1 observation deleted due to missingness)
## AIC: 129.24
##
## Number of Fisher Scoring iterations: 5
We can convert logistic coefficients to probabilities with the plogis function:
plogis(coef(hand.glm))
## (Intercept)
## 0.9237288
(If we just want odds-ratios, then simple exponentiation would suffice.)
This matched our estimate via the classical test. But how do we test against our null hypothesis of 99% probability? First we convert probability to log-odds:
logit.null <- qlogis(.99)
logit.null
## [1] 4.59512
Next, we use Wald confidence intevals to see if that value is included. confint()
confint(hand.glm)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## 2.043428 3.010507
It’s not! We can actually see that the previous p-value is the point where the null would be included (confidence intervals can after alll beformed by inverting tests):
confint(hand.glm,level=1-hand.binom$p.value)
## Waiting for profiling to be done...
## 0 % 100 %
## 1.162023 4.600165
(Some of the decimal digits are off here because it’s a finite sample.)
A generalization of the binomial test is the test of proportions. For the simplest case, it’s simply a different way of expressing the binomial test:
prop.test(218,n=236,p=.99,correct=FALSE)
## Warning in prop.test(218, n = 236, p = 0.99, correct = FALSE): Chi-squared
## approximation may be incorrect
##
## 1-sample proportions test without continuity correction
##
## data: 218 out of 236, null probability 0.99
## X-squared = 104.7, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.99
## 95 percent confidence interval:
## 0.8826712 0.9512130
## sample estimates:
## p
## 0.9237288
(We should probably use Yates’ correction here, but our sample size isn’t that small and the various small sample size corrections hide the asymptotic equivance of the tests. Nonetheless, the \(p\)-values differ here because proportion test is an asymptotic/approximate test, while the binomial test is an exact test. )
With prop.test, we can also express various proportions within multiple categories and compare them. For example, we can see if the proportion of lefties differs between the sexes:
hand.sex <- xtabs(~Sex + W.Hnd,data=survey)
hand.sex
| Sex/W.Hnd | Left | Right |
|---|---|---|
| Female | 7 | 110 |
| Male | 10 | 108 |
# swap the order so that "righties" are in the "success" column on the left
prop.test(hand.sex[, c("Right","Left")])
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: hand.sex[, c("Right", "Left")]
## X-squared = 0.23563, df = 1, p-value = 0.6274
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.04971453 0.09954794
## sample estimates:
## prop 1 prop 2
## 0.9401709 0.9152542
We can also express this as group-wise number of success and totals:
prop.test(c(110,108),n=c(117,118))
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: c(110, 108) out of c(117, 118)
## X-squared = 0.23563, df = 1, p-value = 0.6274
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.04971453 0.09954794
## sample estimates:
## prop 1 prop 2
## 0.9401709 0.9152542
So now, we have by-sex estimates of the proportions as well as an estimate of the difference. (The point estimate of the difference is implicit in the sample estimates and the 95% confidence interval of the difference.) Note that prop.test doesn’t allow you to specify what the proportion is for multiple groups – prop.test only tests for the equality of proportions between groups.
prop.test will even allow you to have an arbitrary number of categories as rows in your table. For example, we could consider the relationship of handedness to smoking:
hand.smoke <- xtabs(~Smoke + W.Hnd,data=survey)
print(hand.smoke)
## W.Hnd
## Smoke Left Right
## Heavy 1 10
## Never 13 175
## Occas 3 16
## Regul 1 16
prop.test(hand.smoke)
## Warning in prop.test(hand.smoke): Chi-squared approximation may be
## incorrect
##
## 4-sample test for equality of proportions without continuity
## correction
##
## data: hand.smoke
## X-squared = 2.0307, df = 3, p-value = 0.5661
## alternative hypothesis: two.sided
## sample estimates:
## prop 1 prop 2 prop 3 prop 4
## 0.09090909 0.06914894 0.15789474 0.05882353
However, for more than two-groups, prop.test doesn’t provide an estimate of the difference because there are multiple estimated differences.
Now, all this success and failure language suggest that we can express these models as binomial GLMs, and we can. For sex, we have:
hand.sex.binom <- glm(W.Hnd ~ 0 + Sex, data=na.omit(survey), family=binomial(link="logit"),
contrasts = list(Sex="contr.Sum"))
hand.sex.binom.null <- glm(W.Hnd ~ 1, data=na.omit(survey), family=binomial(link="logit"))
print(summary(hand.sex.binom))
##
## Call:
## glm(formula = W.Hnd ~ 0 + Sex, family = binomial(link = "logit"),
## data = na.omit(survey), contrasts = list(Sex = "contr.Sum"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3754 0.3503 0.3503 0.4172 0.4172
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## SexFemale 2.7600 0.4611 5.985 2.16e-09 ***
## SexMale 2.3979 0.3948 6.074 1.25e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 232.897 on 168 degrees of freedom
## Residual deviance: 86.099 on 166 degrees of freedom
## AIC: 90.099
##
## Number of Fisher Scoring iterations: 5
print(plogis(coef(hand.sex.binom)))
## SexFemale SexMale
## 0.9404762 0.9166667
# non-overlap of83% CIs corresponds to 95% CI of diff not crossing zero:
# https://rpubs.com/bbolker/overlapCI
print(plogis(confint(hand.sex.binom,level=0.83)))
## Waiting for profiling to be done...
## 8.5 % 91.5 %
## SexFemale 0.8984303 0.9694946
## SexMale 0.8691164 0.9518829
anova(hand.sex.binom.null, hand.sex.binom, test="Chisq")
| Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
|---|---|---|---|---|
| 167 | 86.45906 | NA | NA | NA |
| 166 | 86.09853 | 1 | 0.3605357 | 0.5482089 |
| The \(p\)-valu | e here doesn’ | t exa | ctly match t | he proportion test, but that’s okay. The point estimates for the group-wise proportions are quite similar, and this model has the advantage of being generative. Besides, prop.test delivers two different answers depending on whether or not we enable Yates’ correction and our \(p\)-value falls right between them: |
print(prop.test(hand.sex[ , c(2,1)], correct=FALSE))
##
## 2-sample test for equality of proportions without continuity
## correction
##
## data: hand.sex[, c(2, 1)]
## X-squared = 0.54351, df = 1, p-value = 0.461
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.04120374 0.09103715
## sample estimates:
## prop 1 prop 2
## 0.9401709 0.9152542
print(prop.test(hand.sex[ , c(2,1)], correct=TRUE))
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: hand.sex[, c(2, 1)]
## X-squared = 0.23563, df = 1, p-value = 0.6274
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.04971453 0.09954794
## sample estimates:
## prop 1 prop 2
## 0.9401709 0.9152542
This suggests that our test is doing something right – it’s not quite as conservative as Yates correction, but because it’s an explicit GLM, it’s generative and so we can plot and make predictions and all that.
Now, some of you might have noticed that the multi-condition proportion test can be viewed as having “occurrences [out of total events]” instead of “successes” and “failures”
If you were paying attention earlier, this should make you think of the Poisson model. Let’s try to reformulate this as a Poisson model. First, let’s convert our data from single Bernoulli trials to observed occurences in a given time unit (here: everything).
hand.sex.poisson <- glm(Freq ~ 1 + Sex * W.Hnd, data=as.data.frame(hand.sex), family=poisson(),
contrasts = list(Sex="contr.Sum"))
hand.sex.poisson.null <- glm(Freq ~ 1 + Sex + W.Hnd, data=as.data.frame(hand.sex), family=poisson(),
contrasts = list(Sex="contr.Sum"))
print(summary(hand.sex.poisson))
##
## Call:
## glm(formula = Freq ~ 1 + Sex * W.Hnd, family = poisson(), data = as.data.frame(hand.sex),
## contrasts = list(Sex = "contr.Sum"))
##
## Deviance Residuals:
## [1] 0 0 0 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.40778 0.12777 26.671 <2e-16 ***
## Sex[S.Female] -0.08458 0.12777 -0.662 0.508
## W.Hnd[S.Left] -1.28353 0.12777 -10.046 <2e-16 ***
## Sex[S.Female]:W.Hnd[S.Left] -0.09376 0.12777 -0.734 0.463
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2.0429e+02 on 3 degrees of freedom
## Residual deviance: -1.1102e-15 on 0 degrees of freedom
## AIC: 29.026
##
## Number of Fisher Scoring iterations: 3
anova(hand.sex.poisson.null, hand.sex.poisson, test="Chisq")
| Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
|---|---|---|---|---|
| 1 | 0.5462879 | NA | NA | NA |
| 0 | 0.0000000 | 1 | 0.5462879 | 0.4598384 |
The \(p\)-values for the likelihood-ratio and for the interaction term lie on either side of the \(p\)-value of uncorrected test of proportions. The LRTs are known to be more conservative than the Wald (coefficient) tests, so this is not surprising within the Poisson model and suggests moreover that the Poisson model largely agrees with the uncorrected test for equality of proportions.
Now, you’ve seen that the proportion test is effectively just a generalization of the binomial test. And you’ve seen that we can model the proportion test with a Poisson model. So the following facts from R’s documentation for poisson.test should not surprise you:
The one-sample case is effectively the binomial test with a very large n. The two sample case is converted to a binomial test by conditioning on the total event count, and the rate ratio is directly related to the odds in that binomial distribution.
Let’s take this bit by bit:
The one-sample case is effectively the binomial test with a very large n.
Hmmm.
print(summary(survey$W.Hnd))
## Left Right NA's
## 18 218 1
print(binom.test(218,n=236))
##
## Exact binomial test
##
## data: 218 and 236
## number of successes = 218, number of trials = 236, p-value <
## 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.8821360 0.9541722
## sample estimates:
## probability of success
## 0.9237288
print(prop.test(218,n=236))
##
## 1-sample proportions test with continuity correction
##
## data: 218 out of 236, null probability 0.5
## X-squared = 167.8, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.8801771 0.9528813
## sample estimates:
## p
## 0.9237288
print(poisson.test(218,236))
##
## Exact Poisson test
##
## data: 218 time base: 236
## number of events = 218, time base = 236, p-value = 0.2545
## alternative hypothesis: true event rate is not equal to 1
## 95 percent confidence interval:
## 0.8051693 1.0548307
## sample estimates:
## event rate
## 0.9237288
Oh wow, those do all line up. The Poisson terminology of “rate” per unit “time” sounds weird, but we can also think of “events” being the same as “sucesses” and “rate” being the same as “probability” and then “unit time” being the same as “observation”.
The two sample case is converted to a binomial test by conditioning on the total event count, and the rate ratio is directly related to the odds in that binomial distribution.
hand.sex <- xtabs(~Sex + W.Hnd,data=survey)
print(hand.sex)
## W.Hnd
## Sex Left Right
## Female 7 110
## Male 10 108
# reverse the column order so that right-handedness is "sucess"
hand.sex <- hand.sex[, c("Right","Left")]
binom.female <- binom.test(hand.sex["Female",])
binom.male <- binom.test(hand.sex["Male",])
print(sprintf("Ratio of conditional binomial probabilities: %f", binom.female$estimate / binom.male$estimate))
## [1] "Ratio of conditional binomial probabilities: 1.027224"
total_righties <- sum(hand.sex[,"Right"])
total_females <- sum(hand.sex["Female",])
total_males <- sum(hand.sex["Male",])
total_all <- sum(hand.sex)
binom.test(hand.sex[,"Right"], total_righties, total_females / total_all)
##
## Exact binomial test
##
## data: hand.sex[, "Right"]
## number of successes = 110, number of trials = 218, p-value =
## 0.8923
## alternative hypothesis: true probability of success is not equal to 0.4978723
## 95 percent confidence interval:
## 0.4362578 0.5727901
## sample estimates:
## probability of success
## 0.5045872
prop <- prop.test(hand.sex)
print(prop)
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: hand.sex
## X-squared = 0.23563, df = 1, p-value = 0.6274
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.04971453 0.09954794
## sample estimates:
## prop 1 prop 2
## 0.9401709 0.9152542
print(sprintf("Ratio of proportions: %f", prop$estimate[1]/prop$estimate[2]))
## [1] "Ratio of proportions: 1.027224"
print(poisson.test(hand.sex[,"Right"],c(total_females, total_males)))
##
## Comparison of Poisson rates
##
## data: hand.sex[, "Right"] time base: c(total_females, total_males)
## count1.Female = 110, expected count1 = 108.54, p-value = 0.8923
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
## 0.7804746 1.3522292
## sample estimates:
## rate ratio.Female
## 1.027224
Huh. The \(p\)-values from the binomial test and Poisson test line up perfectly, and ratio of proportions identical across all tests.
(live interactive demonstration)
cf. Lo and Andrews (2015, Frontiers in Psychology doi:10.3389/fpysg.2015.01171)