Categorical Data

Author

Andrew Dalby

Hair and Eye Colour

This is a set of 227 observations taken from Statistical Methods in Biology, 2nd Edition N.J.T Bailey, Hodder and Stoughton, London, 1981.

Eyes <- c(rep("blue",99),rep("grey/green",97),rep("brown",31))
Hair <- c(rep("red/fair", 65),rep("brown",26),rep("black",8),
          rep("red/fair", 32),rep("brown",41),rep("black",24),
          rep("red/fair", 5),rep("brown",16),rep("black",10))

HE <- data.frame(Eyes,Hair)
CT <- xtabs(~Eyes+Hair, data=HE)
addmargins(CT)

            Hair
Eyes         black brown red/fair Sum
  blue           8    26       65  99
  brown         10    16        5  31
  grey/green    24    41       32  97
  Sum           42    83      102 227

CT <- xtabs(~Eyes+Hair, data=HE)
addmargins(CT)
knitr::kable(addmargins(CT))

You can also perform the chi-squared test on the data and calculate the expected cell counts.

cst <- chisq.test(CT)
knitr::kable(addmargins(cst$expected))

	black	brown	red/fair	Sum
blue	18.317181	36.19824	44.48458	99
brown	5.735683	11.33480	13.92952	31
grey/green	17.947137	35.46696	43.58590	97
Sum	42.000000	83.00000	102.00000	227

cst


    Pearson's Chi-squared test

data:  CT
X-squared = 34.945, df = 4, p-value = 4.768e-07

Gender of Offspring

This is a set of 116 observations taken from Statistical Methods in Biology, 2nd Edition N.J.T Bailey, Hodder and Stoughton, London, 1981 regarding the gender of the offspring of stallions.

Stallion <- c(rep(1,22),rep(2,18),rep(3,15),rep(4,21),rep(5,14),rep(6,26))
Gender <- c(rep("male",13),rep("female",9),
            rep("male",8),rep("female",10),
            rep("male",7),rep("female",8),
            rep("male",15),rep("female",6),
            rep("male",9),rep("female",5),
            rep("male",19),rep("female",7)
            )
offspring <- data.frame(Stallion,Gender)
con <- xtabs(~Gender+Stallion, data=offspring)
knitr::kable(addmargins(con))

	1	2	3	4	5	6	Sum
female	9	10	8	6	5	7	45
male	13	8	7	15	9	19	71
Sum	22	18	15	21	14	26	116

cst1 <- chisq.test(con)
knitr::kable(addmargins(cst1$expected))

	1	2	3	4	5	6	Sum
female	8.534483	6.982759	5.818966	8.146552	5.431034	10.08621	45
male	13.465517	11.017241	9.181035	12.853448	8.568965	15.91379	71
Sum	22.000000	18.000000	15.000000	21.000000	14.000000	26.00000	116

cst1


    Pearson's Chi-squared test

data:  con
X-squared = 6.03, df = 5, p-value = 0.3033

Inoculation and Cholera

The 2x2 table is a special case because this only has a single degree of freedom. This results in inferences being biased and so a correction is applied to the calculation of the chi-squared statistic. This is called the Yates continuity correction.

For the epidemiological case which is a specific variant of the 2x2 table we also often want to calculate the risk ratio.

Once more the data is taken from Statistical Methods in Biology, 2nd Edition N.J.T Bailey, Hodder and Stoughton, London, 1981. This time the data is for Cholera inoculations.

vaccine <- c(rep("inoculated",100),rep("uninoculated",100))
disease <- c(rep("infected",11),rep("uninfected",89),rep("infected",21),rep("uninfected",79))
cholera <- data.frame(vaccine,disease)
con1 <- xtabs(~vaccine+disease, data=cholera)
knitr::kable(addmargins(con1))

	infected	uninfected	Sum
inoculated	11	89	100
uninoculated	21	79	100
Sum	32	168	200

When we calculate the chi-square value in R it autoatically applies the Yates continuity correction.

cst2 <- chisq.test(con1)
cst2


    Pearson's Chi-squared test with Yates' continuity correction

data:  con1
X-squared = 3.0134, df = 1, p-value = 0.08258

library(DescTools)
OddsRatio(con1, conf.level=0.05)

odds ratio     lwr.ci     upr.ci 
 0.4649545  0.4533515  0.4768545

Alternatively to the odds ratio we can calculate the relative risk.

RelRisk(con1, conf.level=0.05)

rel. risk    lwr.ci    upr.ci 
0.5238095 0.5126189 0.5352367

A Rare Disease

For small cell count numbers in a 2x2 table it is also possible to calculate the Fisher’s Exact Test.

Once more the data is taken from Statistical Methods in Biology, 2nd Edition N.J.T Bailey, Hodder and Stoughton, London, 1981. This data is for drug treatments of a rare disease.

drug <- c(rep("A",5),rep("B",4))
status <- c(rep("died",4),rep("recovered",1),rep("died",0),rep("recovered",4))
rare_disease <- data.frame(drug,status)
con2 <- xtabs(~drug+status, data=rare_disease)
knitr::kable(addmargins(con2))

	died	recovered	Sum
A	4	1	5
B	0	4	4
Sum	4	5	9

fisher.test(con2)


    Fisher's Exact Test for Count Data

data:  con2
p-value = 0.04762
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.7794899       Inf
sample estimates:
odds ratio 
       Inf

In this case the confidence interval and the odds ratio cannot be calculated because one of the values is 0 which gives a divide by 0 error.