C. Donovan
Most things examined thus far have fitted into the general framework of linear models
There are group of tests conducted on tables of counts/contingency tables which don't fit so well in this framwork that we cover now (you need to be au fait with these)
However, you could use Generalised Linear Models for similar analyses (as seen in MT5761)
We look at intimately related tests
The testing approach is somewhat familiar
Consider the following data tabulating support for Democratic, Republican or Independent candidates. In total there are 2757 individuals in the sample (example from Agresti 2007).
Democrat | Independent | Republican | |
---|---|---|---|
Female | 762 | 327 | 468 |
Male | 484 | 239 | 477 |
Immediate questions arise:
Democrat | Independent | Republican | |
---|---|---|---|
Female | 762 | 327 | 468 |
Male | 484 | 239 | 477 |
To answer this, we need to address:
We'll return to this example later
Following our themes of drugs and/or disaster
Case-study The Scottish Schools Adolescent Lifestyle and Substance Use Survey SALSUS has been established by the Scottish Executive to provide a broad based approach to the monitoring of substance use among young people in Scotland.
For example, does this SALSUS sample represent Scotland ethnically?
We will assess the similarly of SALSUS sample ethnicity to the census population.
Ethnicity | Census | SALSUS | SALSUS counts |
---|---|---|---|
White | 97.26 | 95.23 | 21249 |
Bangladeshi | 0.1 | 0.1 | 23 |
Indian | 0.3 | 0.3 | 68 |
Pakistani | 1.0 | 0.9 | 204 |
Mixed | 0.5 | 0.9 | 204 |
Chinese | 0.4 | 0.5 | 113 |
Blk African | 0.1 | 0.2 | 45 |
Blk Caribbean | 0.02 | 0.1 | 23 |
Blk Other | 0.02 | 0.5 | 45 |
Other | 0.3 | 1.2 | 272 |
Base | 22246 |
The usual strategy:
As for all tests, we have a null (\( H_0 \)) and alternative hypothesis (\( H_1 \)). In this case:
We obtain expected counts directly using our null hypothesis:
Expected count = total \( \times \) specified cell probability
eg. For the 'White' group: 95.23% of 22246 =
\[ 22246 \times \frac{97.26}{100}=22246 \times 0.9726=21693.67 \]
eg. For the 'Bangladeshi' group 0.10% of 22246:
\[ 22246 \times \frac{0.10}{100}=22246 \times 0.001=22.246 \]
The observed counts are obtained from our sample eg. 21249, 23,…,272. We calculate a measure of difference between the observed and expected counts using the Chi-square test statistic:
Chi-square test statistic
\[ \begin{align*} x^2_0 =& \sum_{\textrm{all cells}} \frac{\textrm{(observed count- expected count)}^2}{\textrm{expected count}}\\ =&\sum_{\textrm{all cells}} \frac{\textrm{(O - E)}^2}{\textrm{E}} \end{align*} \]
Note a key component to this term is a simple (squared) distance between what is predicted by our theory, and what we observed in our data (O-E)
Note a key component to this term is a simple (squared) distance between what is predicted by our theory, and what we observed in our data (O-E)
\[ \frac{(21249-21636.45)^2}{21636.45}=6.938 \]
x <- seq(0, 30, length = 100)
plot(x, dchisq(x, 9), type = 'l', lwd = 2)
Looking at that distribution, you know we're effectively zero:
pchisq(1193.9, 9, lower.tail = F)
[1] 2.515343e-251
What is underlying this result? Why do we have such a huge test statistic?
Of course you don't do these things by hand…
ethnicGroup <- c("White", "Bangladeshi", "Indian", "Pakistani", "Mixed",
"Chinese", "Blk African", "Blk Caribbean", "Blk Other", "Other")
salus <- c(21249, 23, 68, 204, 204, 113, 45, 23, 45, 272)
census <- c(97.26, 0.1, 0.3, 1, 0.5, 0.4, 0.1, 0.02, 0.02, 0.3)/100
# specify our table counts (salus) and the hypothesised dist (p = census)
# note a discrete distribution, hence probabilities
salus_test <- chisq.test(salus, p = census)
salusDF <- data.frame(ethnicGroup, salus, census)
head(salusDF)
ethnicGroup salus census
1 White 21249 0.9726
2 Bangladeshi 23 0.0010
3 Indian 68 0.0030
4 Pakistani 204 0.0100
5 Mixed 204 0.0050
6 Chinese 113 0.0040
The test object salus_test
has various useful things:
names(salus_test)
[1] "statistic" "parameter" "p.value" "method" "data.name" "observed"
[7] "expected" "residuals" "stdres"
statistic
= 1193.894752)p.value
= 2.5219128 × 10-251)observed
& expected
Calculating \( (O-E)^2/E \):
chiContrib <- data.frame(ethnicGroup, observed = salus_test$observed,
expected = salus_test$expected, chisq_contrib = (salus_test$observed-salus_test$expected)^2/salus_test$expected)
knitr::kable(chiContrib, digits = 2)
ethnicGroup | observed | expected | chisq_contrib |
---|---|---|---|
White | 21249 | 21636.46 | 6.94 |
Bangladeshi | 23 | 22.25 | 0.03 |
Indian | 68 | 66.74 | 0.02 |
Pakistani | 204 | 222.46 | 1.53 |
Mixed | 204 | 111.23 | 77.37 |
Chinese | 113 | 88.98 | 6.48 |
Blk African | 45 | 22.25 | 23.27 |
Blk Caribbean | 23 | 4.45 | 77.35 |
Blk Other | 45 | 4.45 | 369.59 |
Other | 272 | 66.74 | 631.31 |
So maybe a biased sample, or likely the demographics are changing
(%-ages) 15 year olds pupils' attitudes to those involved with drugs, by drug use status: Scotland 2002.
Statement | Agree (%) | Disagree (%) | Don't know (%) | \( N \) |
---|---|---|---|---|
Used drugs in last month | ||||
People my age who take drugs need help and advice | 21 | 64 | 15 | 2235 |
All people who sell drugs should be punished | 26 | 59 | 15 | 2242 |
People who take drugs are stupid | 21 | 67 | 12 | 2234 |
All people who take drugs should be punished | 6 | 85 | 9 | 2242 |
People who take heroin are junkies | 59 | 30 | 11 | 2234 |
Never used drugs | ||||
People my age who take drugs need help and advice | 76 | 10 | 14 | 6466 |
All people who sell drugs should be punished | 70 | 15 | 15 | 6461 |
People who take drugs are stupid | 65 | 22 | 13 | 6463 |
All people who take drugs should be punished | 29 | 47 | 23 | 6459 |
People who take heroin are junkies | 51 | 20 | 29 | 6458 |
We are going to look at one particular statement 'People who take drugs are stupid'
'People who take drugs are stupid':
Drug use status | Agree | Disagree | Don't know | Total |
---|---|---|---|---|
Used drugs in last month | 469 | 1497 | 268 | 2234 |
Never used drugs | 4201 | 1422 | 840 | 6463 |
Totals | 4670 | 2919 | 1108 | 8697 |
15 year olds pupils' attitudes to those involved with drugs, by drug use status: Scotland 2002
We want to test if attitudes in the population towards those who take drugs differ with drug use.
The null hypothesis (\( H_0 \)) The samples have the same underlying distribution i.e. Opinions are the same for those that have used drugs in the last month and those that have never used drugs.
The alternative hypothesis (\( H_1 \)) The samples do not have the same underlying distribution i.e. Opinions are not the same for those that have used drugs in the last month and those that have never used drugs.
We obtain the expected counts, under the null hypothesis, using:
\[ E = \frac{\textrm{Row total} \times \textrm{Column total}}{\textrm{Grand Total}} \]
eg. 'Used drugs in last month' and 'Agree': \[ \left[\frac{2234 \times 4670}{8697}\right]=119.58 \]
Is our data consistent with the null hypothesis? Our Observed counts are obtained from our sample. eg. 469,1497,…,840.
Easier in R
drugsTable <- as.table(rbind(c(469, 1497, 268), c(4201, 1422, 840)))
dimnames(drugsTable) <- list(group = c("Used recently", "Never used"),
response = c("Agree","Disagree", "Don't know"))
# see what I did here
drugTest <- chisq.test(drugsTable)
drugTest
Pearson's Chi-squared test
data: drugsTable
X-squared = 1602, df = 2, p-value < 2.2e-16
Expected values
knitr::kable(drugTest$expected)
Agree | Disagree | Don't know | |
---|---|---|---|
Used recently | 1199.584 | 749.8041 | 284.6122 |
Never used | 3470.416 | 2169.1959 | 823.3878 |
Observed values
knitr::kable(drugTest$observed)
Agree | Disagree | Don't know | |
---|---|---|---|
Used recently | 469 | 1497 | 268 |
Never used | 4201 | 1422 | 840 |
Their contributions to the test statistic
contributionTable <- (drugTest$observed - drugTest$expected)^2/drugTest$expected
knitr::kable(contributionTable)
Agree | Disagree | Don't know | |
---|---|---|---|
Used recently | 444.9482 | 744.5969 | 0.9696143 |
Never used | 153.8008 | 257.3773 | 0.3351568 |
sum(contributionTable)
[1] 1602.028
Discrepency between observed and expected counts as before:
\( \chi^2 \) test statistic
\[ x^2_0=\sum_{\textrm{all cells}} \frac{\textrm{(O - E)}^2}{\textrm{E}} \]
\[ \frac{[469-1199.58]^2}{1199.58}=444.95 \]
\[ x^2_0 = 444.95+ 744.60+ 0.97+ 153.80+ 257.38+ 0.34=1602.3 \]
\[ df=(3-1) \times (2-1) = 2 \times 1 = 2 \]
qchisq(0.95, 2)
[1] 5.991465
\[ Pr(\chi^2 \geq x^2_0) ~~\textrm{where} ~~ \chi^2 \sim \textrm{Chi-square}(df) \]
pchisq(1602, 2, lower.tail = F)
[1] 0
drugTest$p.value
[1] 0
Let's return to voters and gender
So we seek to test \( H_0 \): gender and voting intention are independent
Democrat | Independent | Republican | |
---|---|---|---|
Female | 762 | 327 | 468 |
Male | 484 | 239 | 477 |
So how do we calculate expected counts assuming \( H_0 \) is true?
voterTable <- as.table(rbind(c(762, 327, 468), c(484, 239, 477)))
dimnames(voterTable) <- list(gender = c("F", "M"),
party = c("Democrat","Independent", "Republican"))
voterTest <- chisq.test(voterTable)
knitr::kable(voterTable)
Democrat | Independent | Republican | |
---|---|---|---|
F | 762 | 327 | 468 |
M | 484 | 239 | 477 |
voterTest
Pearson's Chi-squared test
data: voterTable
X-squared = 30.07, df = 2, p-value = 2.954e-07
Observed and expected values
knitr::kable(voterTest$expected)
Democrat | Independent | Republican | |
---|---|---|---|
F | 703.6714 | 319.6453 | 533.6834 |
M | 542.3286 | 246.3547 | 411.3166 |
knitr::kable(voterTest$observed)
Democrat | Independent | Republican | |
---|---|---|---|
F | 762 | 327 | 468 |
M | 484 | 239 | 477 |
Their contributions to the test statistic
contributionTable <- (voterTest$observed - voterTest$expected)^2/voterTest$expected
knitr::kable(contributionTable)
Democrat | Independent | Republican | |
---|---|---|---|
F | 4.834967 | 0.1692254 | 8.084012 |
M | 6.273369 | 0.2195700 | 10.489006 |
sum(contributionTable)
[1] 30.07015
pchisq(sum(contributionTable), 2, lower.tail = F)
[1] 2.953589e-07
knitr::kable(voterTest$stdres)
Democrat | Independent | Republican | |
---|---|---|---|
F | 4.502053 | 0.6994517 | -5.315945 |
M | -4.502053 | -0.6994517 | 5.315945 |
As ever, there are assumptions
We've covered:
Upcoming: