We will be using the \(\chi^2\) distribution. We will calculate it using the the observed \(O_i\) and the expected \(E_i\). Then \[ \chi^2= \sum_{i=1}^n\frac{\left(O_i-E_i\right)^2}{E_i} \] It will have degrees of freedom \(n-1\). Remember to always use the actual values not the probabilities in the calculation!
| Age | Under 25 | 25-44 | 45-64 | Over 64 |
|---|---|---|---|---|
| Drivers | 41 | 21 | 14 | 24 |
If all ages have the same crash rate, we would expect (because of the age distribution of licensed drivers) the given categories to have 16%, 44%, 27%, 13% of the subjects, respectively. At the 0.05 significance level, test the claim that the distribution of crashes conforms to the distribution of ages.
So to investigate this we are given the observed in the table and the expected via percentages based on number of drivers in each group. So I’ll recreate this table in r and include both observed and expected.
data = data.frame('observed'=c(41,21,14,24),'expected' = c(16,44,27,13))
data
## observed expected
## 1 41 16
## 2 21 44
## 3 14 27
## 4 24 13
Nest we’ll need to run the test, this function requires I pass expected probabilities.
chisq.test(data$observed, p=data$expected/100)
##
## Chi-squared test for given probabilities
##
## data: data$observed
## X-squared = 66.652, df = 3, p-value = 2.223e-14
That doesn’t seem very satisfying to me so I’d like to do it ‘by hand’ too.
chi2 = sum((data$observed - data$expected)^2/data$expected)
chi2
## [1] 66.65218
To get the probability based on the \(\chi^2\), we use the chisq function. Use a p in front to get a probability and a q in front to get a critical value. \(\chi^2\) is always one sided, only positive values are possible and a perfect fit returns 0.
pchisq(chi2,3)-1
## [1] -2.220446e-14
qchisq(.95,3)
## [1] 7.814728
| Number of Nights | Pre-retirement | Post-retirement | Total |
|---|---|---|---|
| 4−7 | 240 | 163 | 403 |
| 8−13 | 79 | 74 | 153 |
| 14−21 | 37 | 51 | 88 |
| 22 or more | 23 | 40 | 63 |
| Total | 379 | 328 | 707 |
This problem is different than the other, sometimes it is called a test for independence. Here we are not asking does it follow and expected way but rather we are looking to see if there is any effect of the categorical variable ‘retirement’. By the way the null hypothesis says that retirement has no effect on the number of nights. The alternative would say that retirement does have an effect.
data2 = data.frame('pre' = c(240,79,37,23),'post'=c(163,74,51,40))
data2
## pre post
## 1 240 163
## 2 79 74
## 3 37 51
## 4 23 40
I don’t type in any of the totals because r can do that much faster than I can! The test in r is not hard either.
chiresults = chisq.test(data2)
print(chiresults)
##
## Pearson's Chi-squared test
##
## data: data2
## X-squared = 18.105, df = 3, p-value = 0.0004184
We can also extract some other info from the test. Expected values are,
chiresults$expected
## pre post
## [1,] 216.03536 186.96464
## [2,] 82.01839 70.98161
## [3,] 47.17397 40.82603
## [4,] 33.77228 29.22772
There area few other things you can access from the test like this but they are all on the printout so I won’t do them all. Here is the degrees of freedom.
chiresults$parameter
## df
## 3
One last note about doing these tables with multiple entries, the degrees of freedom is \((r-1)(c-1)\) where \(r\) is the number of rows and \(c\) is the number of columns.