Categorical Data

Author

Andrew Dalby

Hair and Eye Colour

This is a set of 227 observations taken from Statistical Methods in Biology, 2nd Edition N.J.T Bailey, Hodder and Stoughton, London, 1981.

Eyes <- c(rep("blue",99),rep("grey/green",97),rep("brown",31))
Hair <- c(rep("red/fair", 65),rep("brown",26),rep("black",8),
          rep("red/fair", 32),rep("brown",41),rep("black",24),
          rep("red/fair", 5),rep("brown",16),rep("black",10))

HE <- data.frame(Eyes,Hair)
CT <- xtabs(~Eyes+Hair, data=HE)
addmargins(CT)
            Hair
Eyes         black brown red/fair Sum
  blue           8    26       65  99
  brown         10    16        5  31
  grey/green    24    41       32  97
  Sum           42    83      102 227
CT <- xtabs(~Eyes+Hair, data=HE)
addmargins(CT)
knitr::kable(addmargins(CT))

You can also perform the chi-squared test on the data and calculate the expected cell counts.

cst <- chisq.test(CT)
knitr::kable(addmargins(cst$expected))
black brown red/fair Sum
blue 18.317181 36.19824 44.48458 99
brown 5.735683 11.33480 13.92952 31
grey/green 17.947137 35.46696 43.58590 97
Sum 42.000000 83.00000 102.00000 227
cst

    Pearson's Chi-squared test

data:  CT
X-squared = 34.945, df = 4, p-value = 4.768e-07

Gender of Offspring

This is a set of 116 observations taken from Statistical Methods in Biology, 2nd Edition N.J.T Bailey, Hodder and Stoughton, London, 1981 regarding the gender of the offspring of stallions.

Stallion <- c(rep(1,22),rep(2,18),rep(3,15),rep(4,21),rep(5,14),rep(6,26))
Gender <- c(rep("male",13),rep("female",9),
            rep("male",8),rep("female",10),
            rep("male",7),rep("female",8),
            rep("male",15),rep("female",6),
            rep("male",9),rep("female",5),
            rep("male",19),rep("female",7)
            )
offspring <- data.frame(Stallion,Gender)
con <- xtabs(~Gender+Stallion, data=offspring)
knitr::kable(addmargins(con))
1 2 3 4 5 6 Sum
female 9 10 8 6 5 7 45
male 13 8 7 15 9 19 71
Sum 22 18 15 21 14 26 116
cst1 <- chisq.test(con)
knitr::kable(addmargins(cst1$expected))
1 2 3 4 5 6 Sum
female 8.534483 6.982759 5.818966 8.146552 5.431034 10.08621 45
male 13.465517 11.017241 9.181035 12.853448 8.568965 15.91379 71
Sum 22.000000 18.000000 15.000000 21.000000 14.000000 26.00000 116
cst1

    Pearson's Chi-squared test

data:  con
X-squared = 6.03, df = 5, p-value = 0.3033

Inoculation and Cholera

The 2x2 table is a special case because this only has a single degree of freedom. This results in inferences being biased and so a correction is applied to the calculation of the chi-squared statistic. This is called the Yates continuity correction.

For the epidemiological case which is a specific variant of the 2x2 table we also often want to calculate the risk ratio.

Once more the data is taken from Statistical Methods in Biology, 2nd Edition N.J.T Bailey, Hodder and Stoughton, London, 1981. This time the data is for Cholera inoculations.

vaccine <- c(rep("inoculated",100),rep("uninoculated",100))
disease <- c(rep("infected",11),rep("uninfected",89),rep("infected",21),rep("uninfected",79))
cholera <- data.frame(vaccine,disease)
con1 <- xtabs(~vaccine+disease, data=cholera)
knitr::kable(addmargins(con1))
infected uninfected Sum
inoculated 11 89 100
uninoculated 21 79 100
Sum 32 168 200

When we calculate the chi-square value in R it autoatically applies the Yates continuity correction.

cst2 <- chisq.test(con1)
cst2

    Pearson's Chi-squared test with Yates' continuity correction

data:  con1
X-squared = 3.0134, df = 1, p-value = 0.08258
library(DescTools)
OddsRatio(con1, conf.level=0.05)
odds ratio     lwr.ci     upr.ci 
 0.4649545  0.4533515  0.4768545 

Alternatively to the odds ratio we can calculate the relative risk.

RelRisk(con1, conf.level=0.05)
rel. risk    lwr.ci    upr.ci 
0.5238095 0.5126189 0.5352367 

A Rare Disease

For small cell count numbers in a 2x2 table it is also possible to calculate the Fisher’s Exact Test.

Once more the data is taken from Statistical Methods in Biology, 2nd Edition N.J.T Bailey, Hodder and Stoughton, London, 1981. This data is for drug treatments of a rare disease.

drug <- c(rep("A",5),rep("B",4))
status <- c(rep("died",4),rep("recovered",1),rep("died",0),rep("recovered",4))
rare_disease <- data.frame(drug,status)
con2 <- xtabs(~drug+status, data=rare_disease)
knitr::kable(addmargins(con2))
died recovered Sum
A 4 1 5
B 0 4 4
Sum 4 5 9
fisher.test(con2)

    Fisher's Exact Test for Count Data

data:  con2
p-value = 0.04762
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.7794899       Inf
sample estimates:
odds ratio 
       Inf 

In this case the confidence interval and the odds ratio cannot be calculated because one of the values is 0 which gives a divide by 0 error.