Tabulation of frequencies in R

Introduction

Frequency tabulation is a common statistical method of summarizing data into manageable form without substantial loss of information. This technique is applied to variables that are categorical, where the aim is simply to see how many of the subjects in the sample fall in a given category (or what the proportion/percentage is). When percentages are presented, then they ought to be presented to 0 decimal places (i.e. as whole numbers). On this page we are going to consider the following tabulations.

One way frequency tables
Two way contingency tables (crosstabulations)

Install required packages

The following packages will be required and should therefore be loaded first. If they are not already installed, begin by installing them using the install.packages() function e.g. install.packages(“janitor”)

library(janitor) # contains tabyl() function for tabulation
library(dplyr) # operations on data frames e.g. %>%
library(arsenal) # tableby() function
library(kableExtra) # display table formatting

Data set

A description of the data set and the variable name labels can be found here. The code below downloads the data set and then displays the first 6 observations and a few columns of this data set.

setwd("D:/stemresearch/R/analysis/descriptive-statistics")
nhanesdata <- readRDS(file = url("http://drmathematics.com/learning/datasets/nhanesdata.RDS"))

kbl(nhanesdata[1:6, c(1, 2, 11:15)], 
    caption = "Showing first 6 observations and a few variables.") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")

Showing first 6 observations and a few variables.
id	region	highbp	sex	race	age	agegroup
1400	South	No	Male	White	54	50-59
1401	South	No	Female	White	41	40-49
1402	South	No	Female	Other	21	20-29
1404	South	Yes	Female	White	63	60-69
1405	South	No	Female	White	64	60-69
1406	South	Yes	Female	White	63	60-69

One way frequency table

One-way frequency refers to a tabulation of data which only examines one categorical variable at a time. The one way frequency table displays categorical data in the form of frequency counts and/or relative frequencies (relative frequencies are converted to percentages by multiplying them by 100%).

table1 = tabyl(nhanesdata, region) %>% 
    adorn_totals("row") %>%
    adorn_pct_formatting(digits = 0)
names(table1) = c("Region", "Frequency", "Percent")

kbl(table1, 
    caption = "Table 1: Distribution of participants by region.") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")

Table 1: Distribution of participants by region.
Region	Frequency	Percent
North East	2086	20%
North West	2773	27%
South	2853	28%
West	2625	25%
Total	10337	100%

In the one way frequency table above, the first column Region gives the four regions of the categorical variable, Frequency column gives the frequency counts (i.e. the observed frequencies in each of the four regions) and lastly, Percent column gives the percentages (i.e. the count in a category divided by the total frequency \(\times\) 100). For example, the 27% for North West region is found as follows:

\[ \text{Percent} = \dfrac{2773}{10337} \times 100\% = 26.825965\% \approx 27\% \]

Two way contingency tables (crosstabulation)

A two-way contingency table (also called two-way table or just contingency table), displays data from two categorical variables, where one of the two variables appears on the rows while the other variable appears on the columns.The code below displays the counts of respondents in the data set by regions (North East, North West, South and West) and health status (Poor, Fair, Average, Good and Excellent)

table2 = tabyl(nhanesdata, region, health) %>%
    adorn_totals(c("row", "col"))
names(table2)[1] = "Region"

kbl(table2, align = c("l", rep("r", times = ncol(table2)-1)),
    caption = "Table 2: Distribution of participants by region and health status.") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")

Table 2: Distribution of participants by region and health status.
Region	Poor	Fair	Average	Good	Excellent	Total
North East	77	257	632	558	562	2086
North West	167	419	736	721	730	2773
South	317	532	807	651	546	2853
West	168	462	765	661	569	2625
Total	729	1670	2940	2591	2407	10337

Adding row percentages to a two way contingency table

Row percentages are computed by dividing the counts for an individual cell by the total number of counts for the row in which that particular cell falls. A row percent shows the proportion (or percentage) of subjects in each column from among those in the rows. For example, the 4% in the first cell of the results in Table 3 is calculated as follows. \[ \text{Percent} = \dfrac{77}{2086} \times 100\% = 3.691275\% \approx 4\% \]

We can thus say that, based on this sample, an estimated 4% of the people from North East region have poor health.

table3 = tabyl(nhanesdata, region, health) %>%
    adorn_totals(c("row", "col")) %>%
    adorn_percentages("row") %>% 
    adorn_pct_formatting(digits = 0) %>%
    adorn_ns(position = "front") %>%
    adorn_title("combined")
names(table3)[1] = c("Region")

kbl(table3, align = c("l", rep("r", times = ncol(table3)-1)),
    caption = "Table 3: Distribution of participants by region and health status (row percentages).") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")

Table 3: Distribution of participants by region and health status (row percentages).
Region	Poor	Fair	Average	Good	Excellent	Total
North East	77 (4%)	257 (12%)	632 (30%)	558 (27%)	562 (27%)	2086 (100%)
North West	167 (6%)	419 (15%)	736 (27%)	721 (26%)	730 (26%)	2773 (100%)
South	317 (11%)	532 (19%)	807 (28%)	651 (23%)	546 (19%)	2853 (100%)
West	168 (6%)	462 (18%)	765 (29%)	661 (25%)	569 (22%)	2625 (100%)
Total	729 (7%)	1670 (16%)	2940 (28%)	2591 (25%)	2407 (23%)	10337 (100%)

Adding column percentages to a two way contingency table

Column percentages on the other hand are computed by dividing the counts for an individual cell by the total number of counts for the column in which that particular cell falls. A column percent shows the proportion (or percentage) of subjects in each row from among those in the columns. For example, the 11% percent in the first cell of the results in Table 4 is calculated as follows. \[ \text{Percent} = \dfrac{77}{729} \times 100\% = 10.56241\% \approx 11\% \]

We can thus say that, based on this sample, an estimated 11% of the people with poor health come from North East region.

table4 = tabyl(nhanesdata, region, health) %>%
    adorn_totals(c("row", "col")) %>%
    adorn_percentages("col") %>% 
    adorn_pct_formatting(digits = 0) %>%
    adorn_ns(position = "front") %>%
    adorn_title("combined")
names(table4)[1] = c("Region")

kbl(table4, align = c("l", rep("r", times = ncol(table4)-1)),
    caption = "Table 4: Distribution of participants by region and health status (column percentages).") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")

Table 4: Distribution of participants by region and health status (column percentages).
Region	Poor	Fair	Average	Good	Excellent	Total
North East	77 (11%)	257 (15%)	632 (21%)	558 (22%)	562 (23%)	2086 (20%)
North West	167 (23%)	419 (25%)	736 (25%)	721 (28%)	730 (30%)	2773 (27%)
South	317 (43%)	532 (32%)	807 (27%)	651 (25%)	546 (23%)	2853 (28%)
West	168 (23%)	462 (28%)	765 (26%)	661 (26%)	569 (24%)	2625 (25%)
Total	729 (100%)	1670 (100%)	2940 (100%)	2591 (100%)	2407 (100%)	10337 (100%)

Calculating p values for two categorical variables

Tests for categorical data are used to determine whether an association (or relationship) between two categorical variables in a sample is likely to reflect a real association between these two variables in the population from which this representative sample was drawn. Two statistical techniques that are commonly used to test the association between categorical data are:

Chi-square test of independence
Fisher’s exact test

Begin by labeling a few variables of interest in the nhanesdata data frame. This step just ensures that we have easy to understand labels in the output table. Note that if variable labels are not assigned to the variable names, then variable names will be used as labels in the output table. Remember that variable names and variable labels are different things, variable labels are descriptions that give more information about all or specified variables in the data set.

labels(nhanesdata)  = c(sex = 'Sex', race = "Race", 
                        region = "Region", health = "Health status",
                        age = "Age in years", cholesterol = "Serum cholesterol", 
                        bmi = "Body mass index")

The code below generates crosstabulation of sex by three other categorical variables namely: race, region and health status. It also displays the p-value for making a decision about the statistical significance of the association between the variable sex and the three other categorical variables. Interpretation of statistical significance is discussed in the section Hypothesis testing and p-values.

tab5 = tableby(sex ~ race + region + health, 
               data = nhanesdata)
summary(tab5, digits.p = 3, digits.pct = 1, 
        title = 'Table 5: Crosstabulation of sex with race, region and heath status of respondents.', 
        pfootnote = TRUE)

Table 5: Crosstabulation of sex with race, region and heath status of respondents.
	Male (N=4909)	Female (N=5428)	Total (N=10337)	p value
Race				0.328¹
White	4306 (87.7%)	4745 (87.4%)	9051 (87.6%)
Black	500 (10.2%)	586 (10.8%)	1086 (10.5%)
Other	103 (2.1%)	97 (1.8%)	200 (1.9%)
Region				0.604¹
North East	1013 (20.6%)	1073 (19.8%)	2086 (20.2%)
North West	1310 (26.7%)	1463 (27.0%)	2773 (26.8%)
South	1332 (27.1%)	1521 (28.0%)	2853 (27.6%)
West	1254 (25.5%)	1371 (25.3%)	2625 (25.4%)
Health status				< 0.001¹
Poor	382 (7.8%)	347 (6.4%)	729 (7.1%)
Fair	722 (14.7%)	948 (17.5%)	1670 (16.2%)
Average	1340 (27.3%)	1600 (29.5%)	2940 (28.4%)
Good	1213 (24.7%)	1378 (25.4%)	2591 (25.1%)
Excellent	1252 (25.5%)	1155 (21.3%)	2407 (23.3%)

Pearson’s Chi-squared test

Specifying a test

By default, the Pearson chi-square test (option chisq) is applied for categorical variables. You can also calculate the Fisher’s exact test by specifying the option fe as shown in the code below.

tab6 = tableby(sex ~ fe(race) + region + health, 
               data = nhanesdata)
summary(tab6, digits.p = 3, digits.pct = 1, 
        title = 'Table 6: Crosstabulation of sex with race, region and heath status of respondents.', 
        pfootnote = TRUE)

Table 6: Crosstabulation of sex with race, region and heath status of respondents.
	Male (N=4909)	Female (N=5428)	Total (N=10337)	p value
Race				0.327¹
White	4306 (87.7%)	4745 (87.4%)	9051 (87.6%)
Black	500 (10.2%)	586 (10.8%)	1086 (10.5%)
Other	103 (2.1%)	97 (1.8%)	200 (1.9%)
Region				0.604²
North East	1013 (20.6%)	1073 (19.8%)	2086 (20.2%)
North West	1310 (26.7%)	1463 (27.0%)	2773 (26.8%)
South	1332 (27.1%)	1521 (28.0%)	2853 (27.6%)
West	1254 (25.5%)	1371 (25.3%)	2625 (25.4%)
Health status				< 0.001²
Poor	382 (7.8%)	347 (6.4%)	729 (7.1%)
Fair	722 (14.7%)	948 (17.5%)	1670 (16.2%)
Average	1340 (27.3%)	1600 (29.5%)	2940 (28.4%)
Good	1213 (24.7%)	1378 (25.4%)	2591 (25.1%)
Excellent	1252 (25.5%)	1155 (21.3%)	2407 (23.3%)

Fisher’s Exact Test for Count Data
Pearson’s Chi-squared test

It is important to note that the Fisher’s exact test sometimes fails to converge. For example, an attempt to calculate the Fisher’s exact test for the variables region, fe(region) or health, fe(health) will result in an error.

STEM Research
https://stemresearchs.com