A two-way contingency table is a cross-classification of observations by the levels of two discrete variables.
There are 3 important points we can obtain from the contingency tables: - Compare relative frequency of occurrence of some characteristics of the two groups - Are two characteristics independent? - Is one characteristics a cause for another?
Data: - BMI data set - Smoke data set
Objective:
The goal of this study is to go over
- Distribution of one categorical variable
- Joint and conditional distributions for two categorical variables
- Side by side and stacked bar plots
- Creating a contingency table
- Test for Independence
Load Libraries
library(gmodels)
library(RColorBrewer)
#PREPARING WORK SPAcE
# Clear the workspace:
rm(list = ls())
df<-read.csv('BMI.csv')
head(df)
## age Sex BMI DGEstimate DgDifference
## 1 43 2 normal 32 -11
## 2 63 1 obese 32 -31
## 3 50 1 normal 47 -3
## 4 51 2 underweight 48 -3
## 5 73 2 normal 45 -28
## 6 58 1 underweight 41 -17
#Reordering Factors
# R orders factors alphabetically,therefore we need to re order them based on their level.
df$BMI = factor(df$BMI, levels=c('underweight', 'normal', 'overweight', 'obese'))
df$Sex = factor(df$Sex, levels=c('1', '2'), labels=c('Male', 'Female'))
head(df)
## age Sex BMI DGEstimate DgDifference
## 1 43 Female normal 32 -11
## 2 63 Male obese 32 -31
## 3 50 Male normal 47 -3
## 4 51 Female underweight 48 -3
## 5 73 Female normal 45 -28
## 6 58 Male underweight 41 -17
Frequency table for the Categorical Variable (BMI)
freqBMI <- table(df$BMI)
relfreqBMI <- table(df$BMI)/nrow(df)
cbind(freqBMI, relfreqBMI)
## freqBMI relfreqBMI
## underweight 127 0.254
## normal 233 0.466
## overweight 109 0.218
## obese 31 0.062
Frequency table for the Categorical Variable (Sex)
freqSex <- table(df$Sex)
relfreqSex <- table(df$Sex)/nrow(df)
cbind(freqSex, relfreqSex)
## freqSex relfreqSex
## Male 215 0.43
## Female 285 0.57
joint =CrossTable(df$BMI, df$Sex, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 500
##
##
## | df$Sex
## df$BMI | Male | Female | Row Total |
## -------------|-----------|-----------|-----------|
## underweight | 50 | 77 | 127 |
## | 0.394 | 0.606 | 0.254 |
## | 0.233 | 0.270 | |
## | 0.100 | 0.154 | |
## -------------|-----------|-----------|-----------|
## normal | 110 | 123 | 233 |
## | 0.472 | 0.528 | 0.466 |
## | 0.512 | 0.432 | |
## | 0.220 | 0.246 | |
## -------------|-----------|-----------|-----------|
## overweight | 43 | 66 | 109 |
## | 0.394 | 0.606 | 0.218 |
## | 0.200 | 0.232 | |
## | 0.086 | 0.132 | |
## -------------|-----------|-----------|-----------|
## obese | 12 | 19 | 31 |
## | 0.387 | 0.613 | 0.062 |
## | 0.056 | 0.067 | |
## | 0.024 | 0.038 | |
## -------------|-----------|-----------|-----------|
## Column Total | 215 | 285 | 500 |
## | 0.430 | 0.570 | |
## -------------|-----------|-----------|-----------|
##
##
#Contingency table,
joint$t
## y
## x Male Female
## underweight 50 77
## normal 110 123
## overweight 43 66
## obese 12 19
Bar graph for Counts
#Side by side
joint_counts=joint$t
coul <- brewer.pal(5, "Set2")
barplot(joint_counts, beside=TRUE, col= coul, ylab= 'Frequency', xlab= 'Sex')
legend('topright', c('Underweight', 'Normal','Overwiehgt', 'Obese'), pch=15, col=coul)
#Stacked
barplot(joint_counts, col= coul, ylab= 'Frequency', xlab= 'Sex')
legend('center', c('Underweight', 'Normal','Overwiehgt', 'Obese'), pch=15, col=coul)
#Conditional distribution of sex given BMI,
joint$prop.row
## y
## x Male Female
## underweight 0.3937008 0.6062992
## normal 0.4721030 0.5278970
## overweight 0.3944954 0.6055046
## obese 0.3870968 0.6129032
#Conditional distribution of BMI given Sex,
joint$prop.col
## y
## x Male Female
## underweight 0.23255814 0.27017544
## normal 0.51162791 0.43157895
## overweight 0.20000000 0.23157895
## obese 0.05581395 0.06666667
#Contingency table displaying relative frequencies.
joint$prop.tbl
## y
## x Male Female
## underweight 0.100 0.154
## normal 0.220 0.246
## overweight 0.086 0.132
## obese 0.024 0.038
Exploring Contingency Table
tab <- with(df,table(BMI,Sex))
tab
## Sex
## BMI Male Female
## underweight 50 77
## normal 110 123
## overweight 43 66
## obese 12 19
Creating a Contingency Table
# Using matrix function to create 2x2 contingency table
df1<-matrix(c(123,94,68,104),2,2)
dimnames(df1)= list(sex=c('Male','Female'), Smoke=c('Yes', 'No'))
df1
## Smoke
## sex Yes No
## Male 123 68
## Female 94 104
#Converting into a table
df1 <- as.table(df1)
str(df1)
## 'table' num [1:2, 1:2] 123 94 68 104
## - attr(*, "dimnames")=List of 2
## ..$ sex : chr [1:2] "Male" "Female"
## ..$ Smoke: chr [1:2] "Yes" "No"
Chi-Squares Test of Independence
The hypotheis test of independence can be tested using the general method of goodness of fit test.
H0: the independence model is true HA: the saturated model is true
Note that the independence model is a sub model of the saturated model.
#Let's use the smoke example for the independence test
### Pearson's Chi-squared test with Yates' continuity correction
result<-chisq.test(df1)
result
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: df1
## X-squared = 10.612, df = 1, p-value = 0.001123
We can conclude that there is a strong evidence for rejecting the independence model because p value < 0.05.
###Let's look at the obseved, expected values and the residuals
result$observed
## Smoke
## sex Yes No
## Male 123 68
## Female 94 104
result$expected
## Smoke
## sex Yes No
## Male 106.5476 84.45244
## Female 110.4524 87.54756
result$residuals
## Smoke
## sex Yes No
## Male 1.593891 -1.790294
## Female -1.565463 1.758362
###Pearson's Chi-squared test withOUT Yates' continuity correction
result<-chisq.test(df1, correct=FALSE)
result
##
## Pearson's Chi-squared test
##
## data: df1
## X-squared = 11.288, df = 1, p-value = 0.00078
We can also conclude without Yateโs continuoty correction there is a strong evidence for rejecting the independence model because p value < 0.05.
result$observed
## Smoke
## sex Yes No
## Male 123 68
## Female 94 104
result$expected
## Smoke
## sex Yes No
## Male 106.5476 84.45244
## Female 110.4524 87.54756
result$residuals
## Smoke
## sex Yes No
## Male 1.593891 -1.790294
## Female -1.565463 1.758362
References: 1. Colorado state Lesson Notes.(Generalized Liner models) 2. https://online.stat.psu.edu/stat504/lesson/3/3.1/3.1.1