Categorical Data Analysis 1

Introduction

A two-way contingency table is a cross-classification of observations by the levels of two discrete variables.

There are 3 important points we can obtain from the contingency tables: - Compare relative frequency of occurrence of some characteristics of the two groups - Are two characteristics independent? - Is one characteristics a cause for another?

Data: - BMI data set - Smoke data set

Objective:

The goal of this study is to go over

- Distribution of one categorical variable
- Joint and conditional distributions for two categorical variables
- Side by side and stacked bar plots
- Creating a contingency table
- Test for Independence

Load Libraries

library(gmodels)
library(RColorBrewer)

#PREPARING WORK SPAcE
# Clear the workspace: 
rm(list = ls())

df<-read.csv('BMI.csv')
head(df)

##   age Sex         BMI DGEstimate DgDifference
## 1  43   2      normal         32          -11
## 2  63   1       obese         32          -31
## 3  50   1      normal         47           -3
## 4  51   2 underweight         48           -3
## 5  73   2      normal         45          -28
## 6  58   1 underweight         41          -17

#Reordering Factors

# R orders factors alphabetically,therefore we need to re order them based on their level.
df$BMI = factor(df$BMI, levels=c('underweight', 'normal', 'overweight', 'obese'))
df$Sex = factor(df$Sex, levels=c('1', '2'), labels=c('Male', 'Female'))

head(df)

##   age    Sex         BMI DGEstimate DgDifference
## 1  43 Female      normal         32          -11
## 2  63   Male       obese         32          -31
## 3  50   Male      normal         47           -3
## 4  51 Female underweight         48           -3
## 5  73 Female      normal         45          -28
## 6  58   Male underweight         41          -17

Frequency table for the Categorical Variable (BMI)

freqBMI <- table(df$BMI)
relfreqBMI <- table(df$BMI)/nrow(df)

cbind(freqBMI, relfreqBMI)

##             freqBMI relfreqBMI
## underweight     127      0.254
## normal          233      0.466
## overweight      109      0.218
## obese            31      0.062

Frequency table for the Categorical Variable (Sex)

freqSex <- table(df$Sex)
relfreqSex <- table(df$Sex)/nrow(df)

cbind(freqSex, relfreqSex)

##        freqSex relfreqSex
## Male       215       0.43
## Female     285       0.57

joint =CrossTable(df$BMI, df$Sex, prop.chisq=FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  500 
## 
##  
##              | df$Sex 
##       df$BMI |      Male |    Female | Row Total | 
## -------------|-----------|-----------|-----------|
##  underweight |        50 |        77 |       127 | 
##              |     0.394 |     0.606 |     0.254 | 
##              |     0.233 |     0.270 |           | 
##              |     0.100 |     0.154 |           | 
## -------------|-----------|-----------|-----------|
##       normal |       110 |       123 |       233 | 
##              |     0.472 |     0.528 |     0.466 | 
##              |     0.512 |     0.432 |           | 
##              |     0.220 |     0.246 |           | 
## -------------|-----------|-----------|-----------|
##   overweight |        43 |        66 |       109 | 
##              |     0.394 |     0.606 |     0.218 | 
##              |     0.200 |     0.232 |           | 
##              |     0.086 |     0.132 |           | 
## -------------|-----------|-----------|-----------|
##        obese |        12 |        19 |        31 | 
##              |     0.387 |     0.613 |     0.062 | 
##              |     0.056 |     0.067 |           | 
##              |     0.024 |     0.038 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       215 |       285 |       500 | 
##              |     0.430 |     0.570 |           | 
## -------------|-----------|-----------|-----------|
## 
##

#Contingency table, 
joint$t

##              y
## x             Male Female
##   underweight   50     77
##   normal       110    123
##   overweight    43     66
##   obese         12     19

Bar graph for Counts

#Side by side
joint_counts=joint$t

coul <- brewer.pal(5, "Set2")

barplot(joint_counts, beside=TRUE, col= coul, ylab= 'Frequency', xlab= 'Sex')

legend('topright', c('Underweight', 'Normal','Overwiehgt', 'Obese'), pch=15, col=coul)

#Stacked
barplot(joint_counts, col= coul, ylab= 'Frequency', xlab= 'Sex')

legend('center', c('Underweight', 'Normal','Overwiehgt', 'Obese'), pch=15, col=coul)

#Conditional distribution of sex given BMI,
joint$prop.row

##              y
## x                  Male    Female
##   underweight 0.3937008 0.6062992
##   normal      0.4721030 0.5278970
##   overweight  0.3944954 0.6055046
##   obese       0.3870968 0.6129032

#Conditional distribution of BMI given Sex,
joint$prop.col

##              y
## x                   Male     Female
##   underweight 0.23255814 0.27017544
##   normal      0.51162791 0.43157895
##   overweight  0.20000000 0.23157895
##   obese       0.05581395 0.06666667

#Contingency table displaying relative frequencies.
joint$prop.tbl

##              y
## x              Male Female
##   underweight 0.100  0.154
##   normal      0.220  0.246
##   overweight  0.086  0.132
##   obese       0.024  0.038

Exploring Contingency Table

tab <- with(df,table(BMI,Sex))
tab

##              Sex
## BMI           Male Female
##   underweight   50     77
##   normal       110    123
##   overweight    43     66
##   obese         12     19

Creating a Contingency Table

# Using matrix function to create 2x2 contingency table
df1<-matrix(c(123,94,68,104),2,2)
dimnames(df1)= list(sex=c('Male','Female'), Smoke=c('Yes', 'No'))

df1

##         Smoke
## sex      Yes  No
##   Male   123  68
##   Female  94 104

#Converting into a table
df1 <- as.table(df1)
str(df1)

##  'table' num [1:2, 1:2] 123 94 68 104
##  - attr(*, "dimnames")=List of 2
##   ..$ sex  : chr [1:2] "Male" "Female"
##   ..$ Smoke: chr [1:2] "Yes" "No"

Chi-Squares Test of Independence

The hypotheis test of independence can be tested using the general method of goodness of fit test.

H0: the independence model is true HA: the saturated model is true

Note that the independence model is a sub model of the saturated model.

#Let's use the smoke example for the independence test

### Pearson's Chi-squared test  with Yates' continuity correction
result<-chisq.test(df1)
result

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  df1
## X-squared = 10.612, df = 1, p-value = 0.001123

We can conclude that there is a strong evidence for rejecting the independence model because p value < 0.05.

###Let's look at the obseved, expected values and the residuals
result$observed

##         Smoke
## sex      Yes  No
##   Male   123  68
##   Female  94 104

result$expected

##         Smoke
## sex           Yes       No
##   Male   106.5476 84.45244
##   Female 110.4524 87.54756

result$residuals

##         Smoke
## sex            Yes        No
##   Male    1.593891 -1.790294
##   Female -1.565463  1.758362

###Pearson's Chi-squared test  withOUT Yates' continuity correction
result<-chisq.test(df1, correct=FALSE)
result

## 
##  Pearson's Chi-squared test
## 
## data:  df1
## X-squared = 11.288, df = 1, p-value = 0.00078

We can also conclude without Yate’s continuoty correction there is a strong evidence for rejecting the independence model because p value < 0.05.

result$observed

##         Smoke
## sex      Yes  No
##   Male   123  68
##   Female  94 104

result$expected

##         Smoke
## sex           Yes       No
##   Male   106.5476 84.45244
##   Female 110.4524 87.54756

result$residuals

##         Smoke
## sex            Yes        No
##   Male    1.593891 -1.790294
##   Female -1.565463  1.758362

References:
1. Colorado state Lesson Notes.(Generalized Liner models)
2. https://online.stat.psu.edu/stat504/lesson/3/3.1/3.1.1

***********************

Categorical Data Analysis 1

Mustafa Arslan

10/22/2021