Introduction

Frequency tabulation is a common statistical method of summarizing data into manageable form without substantial loss of information. This technique is applied to variables that are categorical, where the aim is simply to see how many of the subjects in the sample fall in a given category (or what the proportion/percentage is). When percentages are presented, then they ought to be presented to 0 decimal places (i.e. as whole numbers). On this page we are going to consider the following tabulations.

Install required packages


The following packages will be required and should therefore be loaded first. If they are not already installed, begin by installing them using the install.packages() function e.g. install.packages(“janitor”)

library(janitor) # contains tabyl() function for tabulation
library(dplyr) # operations on data frames e.g. %>%
library(arsenal) # tableby() function
library(kableExtra) # display table formatting

Data set


A description of the data set and the variable name labels can be found here. The code below downloads the data set and then displays the first 6 observations and a few columns of this data set.

setwd("D:/stemresearch/R/analysis/descriptive-statistics")
nhanesdata <- readRDS(file = url("http://drmathematics.com/learning/datasets/nhanesdata.RDS"))

kbl(nhanesdata[1:6, c(1, 2, 11:15)], 
    caption = "Showing first 6 observations and a few variables.") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Showing first 6 observations and a few variables.
id region highbp sex race age agegroup
1400 South No Male White 54 50-59
1401 South No Female White 41 40-49
1402 South No Female Other 21 20-29
1404 South Yes Female White 63 60-69
1405 South No Female White 64 60-69
1406 South Yes Female White 63 60-69

One way frequency table


One-way frequency refers to a tabulation of data which only examines one categorical variable at a time. The one way frequency table displays categorical data in the form of frequency counts and/or relative frequencies (relative frequencies are converted to percentages by multiplying them by 100%).

table1 = tabyl(nhanesdata, region) %>% 
    adorn_totals("row") %>%
    adorn_pct_formatting(digits = 0)
names(table1) = c("Region", "Frequency", "Percent")

kbl(table1, 
    caption = "Table 1: Distribution of participants by region.") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Table 1: Distribution of participants by region.
Region Frequency Percent
North East 2086 20%
North West 2773 27%
South 2853 28%
West 2625 25%
Total 10337 100%

In the one way frequency table above, the first column Region gives the four regions of the categorical variable, Frequency column gives the frequency counts (i.e. the observed frequencies in each of the four regions) and lastly, Percent column gives the percentages (i.e. the count in a category divided by the total frequency \(\times\) 100). For example, the 27% for North West region is found as follows:

\[ \text{Percent} = \dfrac{2773}{10337} \times 100\% = 26.825965\% \approx 27\% \]

Two way contingency tables (crosstabulation)


A two-way contingency table (also called two-way table or just contingency table), displays data from two categorical variables, where one of the two variables appears on the rows while the other variable appears on the columns.The code below displays the counts of respondents in the data set by regions (North East, North West, South and West) and health status (Poor, Fair, Average, Good and Excellent)

table2 = tabyl(nhanesdata, region, health) %>%
    adorn_totals(c("row", "col"))
names(table2)[1] = "Region"

kbl(table2, align = c("l", rep("r", times = ncol(table2)-1)),
    caption = "Table 2: Distribution of participants by region and health status.") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Table 2: Distribution of participants by region and health status.
Region Poor Fair Average Good Excellent Total
North East 77 257 632 558 562 2086
North West 167 419 736 721 730 2773
South 317 532 807 651 546 2853
West 168 462 765 661 569 2625
Total 729 1670 2940 2591 2407 10337

Adding row percentages to a two way contingency table


Row percentages are computed by dividing the counts for an individual cell by the total number of counts for the row in which that particular cell falls. A row percent shows the proportion (or percentage) of subjects in each column from among those in the rows. For example, the 4% in the first cell of the results in Table 3 is calculated as follows. \[ \text{Percent} = \dfrac{77}{2086} \times 100\% = 3.691275\% \approx 4\% \]

We can thus say that, based on this sample, an estimated 4% of the people from North East region have poor health.

table3 = tabyl(nhanesdata, region, health) %>%
    adorn_totals(c("row", "col")) %>%
    adorn_percentages("row") %>% 
    adorn_pct_formatting(digits = 0) %>%
    adorn_ns(position = "front") %>%
    adorn_title("combined")
names(table3)[1] = c("Region")

kbl(table3, align = c("l", rep("r", times = ncol(table3)-1)),
    caption = "Table 3: Distribution of participants by region and health status (row percentages).") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Table 3: Distribution of participants by region and health status (row percentages).
Region Poor Fair Average Good Excellent Total
North East 77 (4%) 257 (12%) 632 (30%) 558 (27%) 562 (27%) 2086 (100%)
North West 167 (6%) 419 (15%) 736 (27%) 721 (26%) 730 (26%) 2773 (100%)
South 317 (11%) 532 (19%) 807 (28%) 651 (23%) 546 (19%) 2853 (100%)
West 168 (6%) 462 (18%) 765 (29%) 661 (25%) 569 (22%) 2625 (100%)
Total 729 (7%) 1670 (16%) 2940 (28%) 2591 (25%) 2407 (23%) 10337 (100%)

Adding column percentages to a two way contingency table


Column percentages on the other hand are computed by dividing the counts for an individual cell by the total number of counts for the column in which that particular cell falls. A column percent shows the proportion (or percentage) of subjects in each row from among those in the columns. For example, the 11% percent in the first cell of the results in Table 4 is calculated as follows. \[ \text{Percent} = \dfrac{77}{729} \times 100\% = 10.56241\% \approx 11\% \]

We can thus say that, based on this sample, an estimated 11% of the people with poor health come from North East region.

table4 = tabyl(nhanesdata, region, health) %>%
    adorn_totals(c("row", "col")) %>%
    adorn_percentages("col") %>% 
    adorn_pct_formatting(digits = 0) %>%
    adorn_ns(position = "front") %>%
    adorn_title("combined")
names(table4)[1] = c("Region")

kbl(table4, align = c("l", rep("r", times = ncol(table4)-1)),
    caption = "Table 4: Distribution of participants by region and health status (column percentages).") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Table 4: Distribution of participants by region and health status (column percentages).
Region Poor Fair Average Good Excellent Total
North East 77 (11%) 257 (15%) 632 (21%) 558 (22%) 562 (23%) 2086 (20%)
North West 167 (23%) 419 (25%) 736 (25%) 721 (28%) 730 (30%) 2773 (27%)
South 317 (43%) 532 (32%) 807 (27%) 651 (25%) 546 (23%) 2853 (28%)
West 168 (23%) 462 (28%) 765 (26%) 661 (26%) 569 (24%) 2625 (25%)
Total 729 (100%) 1670 (100%) 2940 (100%) 2591 (100%) 2407 (100%) 10337 (100%)

Calculating p values for two categorical variables


Tests for categorical data are used to determine whether an association (or relationship) between two categorical variables in a sample is likely to reflect a real association between these two variables in the population from which this representative sample was drawn. Two statistical techniques that are commonly used to test the association between categorical data are:

Begin by labeling a few variables of interest in the nhanesdata data frame. This step just ensures that we have easy to understand labels in the output table. Note that if variable labels are not assigned to the variable names, then variable names will be used as labels in the output table. Remember that variable names and variable labels are different things, variable labels are descriptions that give more information about all or specified variables in the data set.

labels(nhanesdata)  = c(sex = 'Sex', race = "Race", 
                        region = "Region", health = "Health status",
                        age = "Age in years", cholesterol = "Serum cholesterol", 
                        bmi = "Body mass index")

The code below generates crosstabulation of sex by three other categorical variables namely: race, region and health status. It also displays the p-value for making a decision about the statistical significance of the association between the variable sex and the three other categorical variables. Interpretation of statistical significance is discussed in the section Hypothesis testing and p-values.

tab5 = tableby(sex ~ race + region + health, 
               data = nhanesdata)
summary(tab5, digits.p = 3, digits.pct = 1, 
        title = 'Table 5: Crosstabulation of sex with race, region and heath status of respondents.', 
        pfootnote = TRUE)
Table 5: Crosstabulation of sex with race, region and heath status of respondents.
Male (N=4909) Female (N=5428) Total (N=10337) p value
Race 0.3281
   White 4306 (87.7%) 4745 (87.4%) 9051 (87.6%)
   Black 500 (10.2%) 586 (10.8%) 1086 (10.5%)
   Other 103 (2.1%) 97 (1.8%) 200 (1.9%)
Region 0.6041
   North East 1013 (20.6%) 1073 (19.8%) 2086 (20.2%)
   North West 1310 (26.7%) 1463 (27.0%) 2773 (26.8%)
   South 1332 (27.1%) 1521 (28.0%) 2853 (27.6%)
   West 1254 (25.5%) 1371 (25.3%) 2625 (25.4%)
Health status < 0.0011
   Poor 382 (7.8%) 347 (6.4%) 729 (7.1%)
   Fair 722 (14.7%) 948 (17.5%) 1670 (16.2%)
   Average 1340 (27.3%) 1600 (29.5%) 2940 (28.4%)
   Good 1213 (24.7%) 1378 (25.4%) 2591 (25.1%)
   Excellent 1252 (25.5%) 1155 (21.3%) 2407 (23.3%)
  1. Pearson’s Chi-squared test

Specifying a test


By default, the Pearson chi-square test (option chisq) is applied for categorical variables. You can also calculate the Fisher’s exact test by specifying the option fe as shown in the code below.

tab6 = tableby(sex ~ fe(race) + region + health, 
               data = nhanesdata)
summary(tab6, digits.p = 3, digits.pct = 1, 
        title = 'Table 6: Crosstabulation of sex with race, region and heath status of respondents.', 
        pfootnote = TRUE)
Table 6: Crosstabulation of sex with race, region and heath status of respondents.
Male (N=4909) Female (N=5428) Total (N=10337) p value
Race 0.3271
   White 4306 (87.7%) 4745 (87.4%) 9051 (87.6%)
   Black 500 (10.2%) 586 (10.8%) 1086 (10.5%)
   Other 103 (2.1%) 97 (1.8%) 200 (1.9%)
Region 0.6042
   North East 1013 (20.6%) 1073 (19.8%) 2086 (20.2%)
   North West 1310 (26.7%) 1463 (27.0%) 2773 (26.8%)
   South 1332 (27.1%) 1521 (28.0%) 2853 (27.6%)
   West 1254 (25.5%) 1371 (25.3%) 2625 (25.4%)
Health status < 0.0012
   Poor 382 (7.8%) 347 (6.4%) 729 (7.1%)
   Fair 722 (14.7%) 948 (17.5%) 1670 (16.2%)
   Average 1340 (27.3%) 1600 (29.5%) 2940 (28.4%)
   Good 1213 (24.7%) 1378 (25.4%) 2591 (25.1%)
   Excellent 1252 (25.5%) 1155 (21.3%) 2407 (23.3%)
  1. Fisher’s Exact Test for Count Data
  2. Pearson’s Chi-squared test

It is important to note that the Fisher’s exact test sometimes fails to converge. For example, an attempt to calculate the Fisher’s exact test for the variables region, fe(region) or health, fe(health) will result in an error.


STEM Research
https://stemresearchs.com