Goodness of Fit and Test for Independence

We will be using the \(\chi^2\) distribution. We will calculate it using the the observed \(O_i\) and the expected \(E_i\). Then \[ \chi^2= \sum_{i=1}^n\frac{\left(O_i-E_i\right)^2}{E_i} \] It will have degrees of freedom \(n-1\). Remember to always use the actual values not the probabilities in the calculation!

  1. Among drivers who have had a car crash in the last year, 100 were randomly selected and categorized by age, with the results listed in the table below.
Age Under 25 25-44 45-64 Over 64
Drivers 41 21 14 24

If all ages have the same crash rate, we would expect (because of the age distribution of licensed drivers) the given categories to have 16%, 44%, 27%, 13% of the subjects, respectively. At the 0.05 significance level, test the claim that the distribution of crashes conforms to the distribution of ages.

So to investigate this we are given the observed in the table and the expected via percentages based on number of drivers in each group. So I’ll recreate this table in r and include both observed and expected.

data = data.frame('observed'=c(41,21,14,24),'expected' = c(16,44,27,13))
data
##   observed expected
## 1       41       16
## 2       21       44
## 3       14       27
## 4       24       13

Nest we’ll need to run the test, this function requires I pass expected probabilities.

chisq.test(data$observed, p=data$expected/100)
## 
##  Chi-squared test for given probabilities
## 
## data:  data$observed
## X-squared = 66.652, df = 3, p-value = 2.223e-14

That doesn’t seem very satisfying to me so I’d like to do it ‘by hand’ too.

chi2 = sum((data$observed - data$expected)^2/data$expected)
chi2
## [1] 66.65218

To get the probability based on the \(\chi^2\), we use the chisq function. Use a p in front to get a probability and a q in front to get a critical value. \(\chi^2\) is always one sided, only positive values are possible and a perfect fit returns 0.

pchisq(chi2,3)-1
## [1] -2.220446e-14
qchisq(.95,3)
## [1] 7.814728
  1. It has been suggested that the highest priority of retirees is travel. Thus, a study was conducted to investigate the differences in the length of stay of a trip for pre- and post-retirees. A sample of 707 travelers were asked how long they stayed on a typical trip. The observed results of the study are found below.
Number of Nights Pre-retirement Post-retirement Total
4−7 240 163 403
8−13 79 74 153
14−21 37 51 88
22 or more 23 40 63
Total 379 328 707

This problem is different than the other, sometimes it is called a test for independence. Here we are not asking does it follow and expected way but rather we are looking to see if there is any effect of the categorical variable ‘retirement’. By the way the null hypothesis says that retirement has no effect on the number of nights. The alternative would say that retirement does have an effect.

data2 = data.frame('pre' = c(240,79,37,23),'post'=c(163,74,51,40))
data2
##   pre post
## 1 240  163
## 2  79   74
## 3  37   51
## 4  23   40

I don’t type in any of the totals because r can do that much faster than I can! The test in r is not hard either.

chiresults = chisq.test(data2)
print(chiresults)
## 
##  Pearson's Chi-squared test
## 
## data:  data2
## X-squared = 18.105, df = 3, p-value = 0.0004184

We can also extract some other info from the test. Expected values are,

chiresults$expected
##            pre      post
## [1,] 216.03536 186.96464
## [2,]  82.01839  70.98161
## [3,]  47.17397  40.82603
## [4,]  33.77228  29.22772

There area few other things you can access from the test like this but they are all on the printout so I won’t do them all. Here is the degrees of freedom.

chiresults$parameter
## df 
##  3

One last note about doing these tables with multiple entries, the degrees of freedom is \((r-1)(c-1)\) where \(r\) is the number of rows and \(c\) is the number of columns.