Frequency tabulation is a common statistical method of summarizing data into manageable form without substantial loss of information. This technique is applied to variables that are categorical, where the aim is simply to see how many of the subjects in the sample fall in a given category (or what the proportion/percentage is). When percentages are presented, then they ought to be presented to 0 decimal places (i.e. as whole numbers). On this page we are going to consider the following tabulations.
One way frequency tables
Two way contingency tables (crosstabulations)
The following packages will be required and should therefore be loaded first. If they are not already installed, begin by installing them using the install.packages() function e.g. install.packages(“janitor”)
library(janitor) # contains tabyl() function for tabulation
library(dplyr) # operations on data frames e.g. %>%
library(arsenal) # tableby() function
library(kableExtra) # display table formatting
A description of the data set and the variable name labels can be found here. The code below downloads the data set and then displays the first 6 observations and a few columns of this data set.
setwd("D:/stemresearch/R/analysis/descriptive-statistics")
nhanesdata <- readRDS(file = url("http://drmathematics.com/learning/datasets/nhanesdata.RDS"))
kbl(nhanesdata[1:6, c(1, 2, 11:15)],
caption = "Showing first 6 observations and a few variables.") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
id | region | highbp | sex | race | age | agegroup |
---|---|---|---|---|---|---|
1400 | South | No | Male | White | 54 | 50-59 |
1401 | South | No | Female | White | 41 | 40-49 |
1402 | South | No | Female | Other | 21 | 20-29 |
1404 | South | Yes | Female | White | 63 | 60-69 |
1405 | South | No | Female | White | 64 | 60-69 |
1406 | South | Yes | Female | White | 63 | 60-69 |
One-way frequency refers to a tabulation of data which only examines one categorical variable at a time. The one way frequency table displays categorical data in the form of frequency counts and/or relative frequencies (relative frequencies are converted to percentages by multiplying them by 100%).
table1 = tabyl(nhanesdata, region) %>%
adorn_totals("row") %>%
adorn_pct_formatting(digits = 0)
names(table1) = c("Region", "Frequency", "Percent")
kbl(table1,
caption = "Table 1: Distribution of participants by region.") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Region | Frequency | Percent |
---|---|---|
North East | 2086 | 20% |
North West | 2773 | 27% |
South | 2853 | 28% |
West | 2625 | 25% |
Total | 10337 | 100% |
In the one way frequency table above, the first column Region gives the four regions of the categorical variable, Frequency column gives the frequency counts (i.e. the observed frequencies in each of the four regions) and lastly, Percent column gives the percentages (i.e. the count in a category divided by the total frequency \(\times\) 100). For example, the 27% for North West region is found as follows:
\[ \text{Percent} = \dfrac{2773}{10337} \times 100\% = 26.825965\% \approx 27\% \]
A two-way contingency table (also called two-way table or just contingency table), displays data from two categorical variables, where one of the two variables appears on the rows while the other variable appears on the columns.The code below displays the counts of respondents in the data set by regions (North East, North West, South and West) and health status (Poor, Fair, Average, Good and Excellent)
table2 = tabyl(nhanesdata, region, health) %>%
adorn_totals(c("row", "col"))
names(table2)[1] = "Region"
kbl(table2, align = c("l", rep("r", times = ncol(table2)-1)),
caption = "Table 2: Distribution of participants by region and health status.") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Region | Poor | Fair | Average | Good | Excellent | Total |
---|---|---|---|---|---|---|
North East | 77 | 257 | 632 | 558 | 562 | 2086 |
North West | 167 | 419 | 736 | 721 | 730 | 2773 |
South | 317 | 532 | 807 | 651 | 546 | 2853 |
West | 168 | 462 | 765 | 661 | 569 | 2625 |
Total | 729 | 1670 | 2940 | 2591 | 2407 | 10337 |
Row percentages are computed by dividing the counts for an individual cell by the total number of counts for the row in which that particular cell falls. A row percent shows the proportion (or percentage) of subjects in each column from among those in the rows. For example, the 4% in the first cell of the results in Table 3 is calculated as follows. \[ \text{Percent} = \dfrac{77}{2086} \times 100\% = 3.691275\% \approx 4\% \]
We can thus say that, based on this sample, an estimated 4% of the people from North East region have poor health.
table3 = tabyl(nhanesdata, region, health) %>%
adorn_totals(c("row", "col")) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 0) %>%
adorn_ns(position = "front") %>%
adorn_title("combined")
names(table3)[1] = c("Region")
kbl(table3, align = c("l", rep("r", times = ncol(table3)-1)),
caption = "Table 3: Distribution of participants by region and health status (row percentages).") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Region | Poor | Fair | Average | Good | Excellent | Total |
---|---|---|---|---|---|---|
North East | 77 (4%) | 257 (12%) | 632 (30%) | 558 (27%) | 562 (27%) | 2086 (100%) |
North West | 167 (6%) | 419 (15%) | 736 (27%) | 721 (26%) | 730 (26%) | 2773 (100%) |
South | 317 (11%) | 532 (19%) | 807 (28%) | 651 (23%) | 546 (19%) | 2853 (100%) |
West | 168 (6%) | 462 (18%) | 765 (29%) | 661 (25%) | 569 (22%) | 2625 (100%) |
Total | 729 (7%) | 1670 (16%) | 2940 (28%) | 2591 (25%) | 2407 (23%) | 10337 (100%) |
Column percentages on the other hand are computed by dividing the counts for an individual cell by the total number of counts for the column in which that particular cell falls. A column percent shows the proportion (or percentage) of subjects in each row from among those in the columns. For example, the 11% percent in the first cell of the results in Table 4 is calculated as follows. \[ \text{Percent} = \dfrac{77}{729} \times 100\% = 10.56241\% \approx 11\% \]
We can thus say that, based on this sample, an estimated 11% of the people with poor health come from North East region.
table4 = tabyl(nhanesdata, region, health) %>%
adorn_totals(c("row", "col")) %>%
adorn_percentages("col") %>%
adorn_pct_formatting(digits = 0) %>%
adorn_ns(position = "front") %>%
adorn_title("combined")
names(table4)[1] = c("Region")
kbl(table4, align = c("l", rep("r", times = ncol(table4)-1)),
caption = "Table 4: Distribution of participants by region and health status (column percentages).") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Region | Poor | Fair | Average | Good | Excellent | Total |
---|---|---|---|---|---|---|
North East | 77 (11%) | 257 (15%) | 632 (21%) | 558 (22%) | 562 (23%) | 2086 (20%) |
North West | 167 (23%) | 419 (25%) | 736 (25%) | 721 (28%) | 730 (30%) | 2773 (27%) |
South | 317 (43%) | 532 (32%) | 807 (27%) | 651 (25%) | 546 (23%) | 2853 (28%) |
West | 168 (23%) | 462 (28%) | 765 (26%) | 661 (26%) | 569 (24%) | 2625 (25%) |
Total | 729 (100%) | 1670 (100%) | 2940 (100%) | 2591 (100%) | 2407 (100%) | 10337 (100%) |
Tests for categorical data are used to determine whether an association (or relationship) between two categorical variables in a sample is likely to reflect a real association between these two variables in the population from which this representative sample was drawn. Two statistical techniques that are commonly used to test the association between categorical data are:
Chi-square test of independence
Fisher’s exact test
Begin by labeling a few variables of interest in the nhanesdata data frame. This step just ensures that we have easy to understand labels in the output table. Note that if variable labels are not assigned to the variable names, then variable names will be used as labels in the output table. Remember that variable names and variable labels are different things, variable labels are descriptions that give more information about all or specified variables in the data set.
labels(nhanesdata) = c(sex = 'Sex', race = "Race",
region = "Region", health = "Health status",
age = "Age in years", cholesterol = "Serum cholesterol",
bmi = "Body mass index")
The code below generates crosstabulation of sex by three other categorical variables namely: race, region and health status. It also displays the p-value for making a decision about the statistical significance of the association between the variable sex and the three other categorical variables. Interpretation of statistical significance is discussed in the section Hypothesis testing and p-values.
tab5 = tableby(sex ~ race + region + health,
data = nhanesdata)
summary(tab5, digits.p = 3, digits.pct = 1,
title = 'Table 5: Crosstabulation of sex with race, region and heath status of respondents.',
pfootnote = TRUE)
Male (N=4909) | Female (N=5428) | Total (N=10337) | p value | |
---|---|---|---|---|
Race | 0.3281 | |||
White | 4306 (87.7%) | 4745 (87.4%) | 9051 (87.6%) | |
Black | 500 (10.2%) | 586 (10.8%) | 1086 (10.5%) | |
Other | 103 (2.1%) | 97 (1.8%) | 200 (1.9%) | |
Region | 0.6041 | |||
North East | 1013 (20.6%) | 1073 (19.8%) | 2086 (20.2%) | |
North West | 1310 (26.7%) | 1463 (27.0%) | 2773 (26.8%) | |
South | 1332 (27.1%) | 1521 (28.0%) | 2853 (27.6%) | |
West | 1254 (25.5%) | 1371 (25.3%) | 2625 (25.4%) | |
Health status | < 0.0011 | |||
Poor | 382 (7.8%) | 347 (6.4%) | 729 (7.1%) | |
Fair | 722 (14.7%) | 948 (17.5%) | 1670 (16.2%) | |
Average | 1340 (27.3%) | 1600 (29.5%) | 2940 (28.4%) | |
Good | 1213 (24.7%) | 1378 (25.4%) | 2591 (25.1%) | |
Excellent | 1252 (25.5%) | 1155 (21.3%) | 2407 (23.3%) |
By default, the Pearson chi-square test (option chisq) is applied for categorical variables. You can also calculate the Fisher’s exact test by specifying the option fe as shown in the code below.
tab6 = tableby(sex ~ fe(race) + region + health,
data = nhanesdata)
summary(tab6, digits.p = 3, digits.pct = 1,
title = 'Table 6: Crosstabulation of sex with race, region and heath status of respondents.',
pfootnote = TRUE)
Male (N=4909) | Female (N=5428) | Total (N=10337) | p value | |
---|---|---|---|---|
Race | 0.3271 | |||
White | 4306 (87.7%) | 4745 (87.4%) | 9051 (87.6%) | |
Black | 500 (10.2%) | 586 (10.8%) | 1086 (10.5%) | |
Other | 103 (2.1%) | 97 (1.8%) | 200 (1.9%) | |
Region | 0.6042 | |||
North East | 1013 (20.6%) | 1073 (19.8%) | 2086 (20.2%) | |
North West | 1310 (26.7%) | 1463 (27.0%) | 2773 (26.8%) | |
South | 1332 (27.1%) | 1521 (28.0%) | 2853 (27.6%) | |
West | 1254 (25.5%) | 1371 (25.3%) | 2625 (25.4%) | |
Health status | < 0.0012 | |||
Poor | 382 (7.8%) | 347 (6.4%) | 729 (7.1%) | |
Fair | 722 (14.7%) | 948 (17.5%) | 1670 (16.2%) | |
Average | 1340 (27.3%) | 1600 (29.5%) | 2940 (28.4%) | |
Good | 1213 (24.7%) | 1378 (25.4%) | 2591 (25.1%) | |
Excellent | 1252 (25.5%) | 1155 (21.3%) | 2407 (23.3%) |
It is important to note that the Fisher’s exact test sometimes fails to converge. For example, an attempt to calculate the Fisher’s exact test for the variables region, fe(region) or health, fe(health) will result in an error.
STEM Research
https://stemresearchs.com