This is a part of basic statistical analysis topic of FETP training,
Thailand.
This article aim to provide basis R code about basic categorical data
comparison for those who not familiar with R.
The data set for this article is not provided.
Categorical data comparison
For categorical data we can campare the different between proportion or
ratio by (Pearson’s) Chi-square’s test and Fischer exact test. The
different between these two methods is the sample size of data.
Assumptions for chi-square test.
1.Data should be random sample.
3.The variables of interest are categorical.
2.All observations must be independent.
3.Expected value is equal or more than 5 in 80% of cells in larger
tables, or no cells with 0 in all cells of 2x2 table.
4.Sample size is large enough.
Hypothesis for chi-square test : H0 : the two proportions or ratios are the same , in other words the different between them is equal to zero.
Example The angina dataset from 100 participants record their dermographic varibles such as age, sex, smoke status and the outcome whether or not have angina.
##Start with calling the relevant libraries
library(readxl)
library(tidyverse)
angina <- read_xlsx("dataset_basic_2.xlsx",
sheet = "angina")
head(angina)
## # A tibble: 6 × 13
## id age sex ihdfami fat diabetes smoke alcohol weight height sbp
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 59 1 0 1 0 2 2 58.5 164. 185
## 2 2 55 1 0 0 0 1 1 73 174. 133
## 3 3 55 0 0 1 1 0 0 52 161 110
## 4 4 54 0 0 1 1 0 0 51 160 110
## 5 5 53 0 0 1 1 0 0 53 162 110
## 6 6 54 0 0 1 1 0 0 50 159 110
## # … with 2 more variables: dbp <dbl>, angina <dbl>
Is there any association between smoking and sex?
Let examine the data.
angina$smoke <- factor(angina$smoke,
labels = c("no","smoking low","smoking high"))
angina$sex <- factor(angina$sex,labels = c("female","male"))
prop.table(table(angina$smoke,angina$sex),2)
##
## female male
## no 0.4126984 0.0000000
## smoking low 0.4603175 0.3783784
## smoking high 0.1269841 0.6216216
The table shows that men trend to smoking more than women (100% vs 59%) but this result is the true result or it just the chance. We can answer this question by setting null hypothesis “The proportion of smoking in men and women are equal” and then use chi-square test to test this null hypothesis.
chisq.test(x = angina$sex,
y = angina$smoke)
##
## Pearson's Chi-squared test
##
## data: angina$sex and angina$smoke
## X-squared = 34.031, df = 2, p-value = 4.076e-08
Test shows p-value less than 0.01 so we can reject this null hypothesis and conclude that porportion of smoking of men and women are statistical different.
Fisher’s exact test
Another useful statistical test for categorical data is Fisher’s exact
test which will play the role when the sample size is small. Fisher
exact test give more conservative p-value.
Example
ecmo <- read_xlsx("dataset_basic_2.xlsx",
sheet = "ECMO")
head(ecmo)
## # A tibble: 6 × 6
## subject treatment result ...4 ...5 `Extracorporeal membrane oxygenation (E…`
## <dbl> <chr> <chr> <lgl> <lgl> <lgl>
## 1 1 CMT die NA NA NA
## 2 2 ECMO live NA NA NA
## 3 3 ECMO live NA NA NA
## 4 4 ECMO live NA NA NA
## 5 5 ECMO live NA NA NA
## 6 6 CMT live NA NA NA
The dataset record treatment applied and result from the treatment of
39 participants.
If we want to know is there an association between using ECMO and death.
We can set null hypothesis like “The porportion of death among those who
treated with ECMO equal to those who treated with CMT” and then test
this null hypothesis. Because this dataset consist of small sample size
we will use Fisher’s exact test instead of chi-square test.
fisher.test(x = ecmo$treatment,
y = ecmo$result)
##
## Fisher's Exact Test for Count Data
##
## data: ecmo$treatment and ecmo$result
## p-value = 0.01102
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.366318 944.080411
## sample estimates:
## odds ratio
## 16.78571
The p-value from test is 0.011, so we can reject null hypothesis and conclude that the proportion of death among those who treated with ECMO is statistical difference from those who treated with CMT. In other words we can say “There is an association between using ECMO and death.”
McNemar’s test
McNemar’s test is another statistical test for independent (paired)
categorical variables. The assumptions of McNemar’s test are
1.One nominal variable with two categories (i.e. dichotomous
variables like sick or not sick) and one independent variable with
two connected groups.
2.Dependent variable must be mutually exclusive.
3.Random sample.
postsurvey <- read_xlsx("dataset_basic_2.xlsx",
sheet = "PostSurvey")
head(postsurvey)
## # A tibble: 6 × 26
## ID gender age classification happy sleep_Tues sleep_Sat hair_color
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
## 1 1 Female 19 Sophomore 89 8 13 brown
## 2 2 Female 19 Sophomore 90 7 10 black
## 3 3 Female 18 Freshman 60 9 9 brown
## 4 4 Female 19 Sophomore 100 8.5 8.5 black
## 5 5 Female 18 Freshman 85 8 10 black
## 6 6 Female 19 Sophomore 90 6 10 brown
## # … with 18 more variables: exclusive <dbl>, greek <chr>, smoke <chr>,
## # talking_min <dbl>, texts_sent <dbl>, live_campus <chr>, roomates <dbl>,
## # austin <dbl>, commute <chr>, UT_sport <chr>, major <chr>,
## # hw_hours_HS <dbl>, hw_hours_college <dbl>, post_happy <dbl>,
## # post_exclusive <dbl>, post_smoke <chr>, post_talking_min <dbl>,
## # post_text_sent <dbl>
The dataset record data about behaviors of students from the beginning of semester compare to the end. Is there a different in proportion of smoking at the beginning compare to the end of the semester? Let examine out data.
postsurvey <- postsurvey %>%
mutate(smoke = case_when(smoke %in% c("yes","only socially")~ "yes",
TRUE ~ "no"),
post_smoke = case_when(post_smoke %in% c("yes","only socially")~ "yes",
TRUE ~ "no"))
##Grouping "yes" first
round(prop.table(table(x = postsurvey$smoke,
y = postsurvey$post_smoke),1),2)
## y
## x no yes
## no 0.98 0.02
## yes 0.11 0.89
We already know that this is independent test, so we should use McNemar’s test.
mcnemar.test(x = postsurvey$smoke,
y = postsurvey$post_smoke)
##
## McNemar's Chi-squared test with continuity correction
##
## data: postsurvey$smoke and postsurvey$post_smoke
## McNemar's chi-squared = 0.16667, df = 1, p-value = 0.6831
P value from test above show that we failed to reject null hypothesis, mean that the proportion of smoking at the beginning of semester is not difference from the end.