Basic statistical analysis with R : categorical data comparison

This is a part of basic statistical analysis topic of FETP training, Thailand.
This article aim to provide basis R code about basic categorical data comparison for those who not familiar with R.
The data set for this article is not provided.
Categorical data comparison
For categorical data we can campare the different between proportion or ratio by (Pearson’s) Chi-square’s test and Fischer exact test. The different between these two methods is the sample size of data.
Assumptions for chi-square test.
1.Data should be random sample.
3.The variables of interest are categorical.
2.All observations must be independent.
3.Expected value is equal or more than 5 in 80% of cells in larger tables, or no cells with 0 in all cells of 2x2 table.
4.Sample size is large enough.

Hypothesis for chi-square test : H0 : the two proportions or ratios are the same , in other words the different between them is equal to zero.

Example The angina dataset from 100 participants record their dermographic varibles such as age, sex, smoke status and the outcome whether or not have angina.

##Start with calling the relevant libraries  
library(readxl) 
library(tidyverse)

angina <- read_xlsx("dataset_basic_2.xlsx",
                    sheet = "angina")
            
head(angina)

## # A tibble: 6 × 13
##      id   age   sex ihdfami   fat diabetes smoke alcohol weight height   sbp
##   <dbl> <dbl> <dbl>   <dbl> <dbl>    <dbl> <dbl>   <dbl>  <dbl>  <dbl> <dbl>
## 1     1    59     1       0     1        0     2       2   58.5   164.   185
## 2     2    55     1       0     0        0     1       1   73     174.   133
## 3     3    55     0       0     1        1     0       0   52     161    110
## 4     4    54     0       0     1        1     0       0   51     160    110
## 5     5    53     0       0     1        1     0       0   53     162    110
## 6     6    54     0       0     1        1     0       0   50     159    110
## # … with 2 more variables: dbp <dbl>, angina <dbl>

Is there any association between smoking and sex?
Let examine the data.

angina$smoke <- factor(angina$smoke,
                       labels = c("no","smoking low","smoking high"))
angina$sex <- factor(angina$sex,labels = c("female","male"))

prop.table(table(angina$smoke,angina$sex),2)

##               
##                   female      male
##   no           0.4126984 0.0000000
##   smoking low  0.4603175 0.3783784
##   smoking high 0.1269841 0.6216216

The table shows that men trend to smoking more than women (100% vs 59%) but this result is the true result or it just the chance. We can answer this question by setting null hypothesis “The proportion of smoking in men and women are equal” and then use chi-square test to test this null hypothesis.

chisq.test(x = angina$sex,
           y = angina$smoke)

## 
##  Pearson's Chi-squared test
## 
## data:  angina$sex and angina$smoke
## X-squared = 34.031, df = 2, p-value = 4.076e-08

Test shows p-value less than 0.01 so we can reject this null hypothesis and conclude that porportion of smoking of men and women are statistical different.

Fisher’s exact test
Another useful statistical test for categorical data is Fisher’s exact test which will play the role when the sample size is small. Fisher exact test give more conservative p-value.

Example

ecmo <- read_xlsx("dataset_basic_2.xlsx",
                    sheet = "ECMO")
head(ecmo)

## # A tibble: 6 × 6
##   subject treatment result ...4  ...5  `Extracorporeal membrane oxygenation (E…`
##     <dbl> <chr>     <chr>  <lgl> <lgl> <lgl>                                    
## 1       1 CMT       die    NA    NA    NA                                       
## 2       2 ECMO      live   NA    NA    NA                                       
## 3       3 ECMO      live   NA    NA    NA                                       
## 4       4 ECMO      live   NA    NA    NA                                       
## 5       5 ECMO      live   NA    NA    NA                                       
## 6       6 CMT       live   NA    NA    NA

The dataset record treatment applied and result from the treatment of 39 participants.
If we want to know is there an association between using ECMO and death. We can set null hypothesis like “The porportion of death among those who treated with ECMO equal to those who treated with CMT” and then test this null hypothesis. Because this dataset consist of small sample size we will use Fisher’s exact test instead of chi-square test.

fisher.test(x = ecmo$treatment,
            y = ecmo$result)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  ecmo$treatment and ecmo$result
## p-value = 0.01102
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##    1.366318 944.080411
## sample estimates:
## odds ratio 
##   16.78571

The p-value from test is 0.011, so we can reject null hypothesis and conclude that the proportion of death among those who treated with ECMO is statistical difference from those who treated with CMT. In other words we can say “There is an association between using ECMO and death.”

McNemar’s test
McNemar’s test is another statistical test for independent (paired) categorical variables. The assumptions of McNemar’s test are
1.One nominal variable with two categories (i.e. dichotomous variables like sick or not sick) and one independent variable with two connected groups.
2.Dependent variable must be mutually exclusive.
3.Random sample.

postsurvey <- read_xlsx("dataset_basic_2.xlsx",
                    sheet = "PostSurvey")
head(postsurvey)

## # A tibble: 6 × 26
##      ID gender   age classification happy sleep_Tues sleep_Sat hair_color
##   <dbl> <chr>  <dbl> <chr>          <dbl>      <dbl>     <dbl> <chr>     
## 1     1 Female    19 Sophomore         89        8        13   brown     
## 2     2 Female    19 Sophomore         90        7        10   black     
## 3     3 Female    18 Freshman          60        9         9   brown     
## 4     4 Female    19 Sophomore        100        8.5       8.5 black     
## 5     5 Female    18 Freshman          85        8        10   black     
## 6     6 Female    19 Sophomore         90        6        10   brown     
## # … with 18 more variables: exclusive <dbl>, greek <chr>, smoke <chr>,
## #   talking_min <dbl>, texts_sent <dbl>, live_campus <chr>, roomates <dbl>,
## #   austin <dbl>, commute <chr>, UT_sport <chr>, major <chr>,
## #   hw_hours_HS <dbl>, hw_hours_college <dbl>, post_happy <dbl>,
## #   post_exclusive <dbl>, post_smoke <chr>, post_talking_min <dbl>,
## #   post_text_sent <dbl>

The dataset record data about behaviors of students from the beginning of semester compare to the end. Is there a different in proportion of smoking at the beginning compare to the end of the semester? Let examine out data.

postsurvey <- postsurvey %>%
                  mutate(smoke = case_when(smoke %in% c("yes","only socially")~ "yes",
                                           TRUE ~ "no"),
                         post_smoke = case_when(post_smoke %in% c("yes","only socially")~ "yes",
                                                TRUE ~ "no"))  
##Grouping "yes" first
round(prop.table(table(x = postsurvey$smoke,
             y = postsurvey$post_smoke),1),2)

##      y
## x       no  yes
##   no  0.98 0.02
##   yes 0.11 0.89

We already know that this is independent test, so we should use McNemar’s test.

mcnemar.test(x = postsurvey$smoke,
             y = postsurvey$post_smoke)

## 
##  McNemar's Chi-squared test with continuity correction
## 
## data:  postsurvey$smoke and postsurvey$post_smoke
## McNemar's chi-squared = 0.16667, df = 1, p-value = 0.6831

P value from test above show that we failed to reject null hypothesis, mean that the proportion of smoking at the beginning of semester is not difference from the end.

Basic statistical analysis with R : categorical data comparison

Jirapanakorn Sutham

2022-07-13