This data dive will focus on the documentation associated with a CDC dataset that provides weekly data of COVID-19 cases and deaths reported in the United States from March 7, 2020 through November 18, 2023.
The dataset was downloaded from the CDC Data Catalog, which provides documentation describing this dataset.
Let’s examine some ways in which the provided documentation helps clarify this dataset, as well as some ways in which the documentation could be improved.
To get started, let’s load a few R packages to assist with our analysis.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
Next, let’s read in the dataset from CSV.
covid <- read_delim("./COVID_weekly_cases_deaths_region5.csv", delim = ",")
## Rows: 37867 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): end_of_week, jurisdiction, age_group, sex, race_ethnicity_combined
## dbl (4): case_count_suppressed, death_count_suppressed, case_crude_rate_supp...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Let’s view the complete list of 9 column names in this dataset.
colnames(covid)
## [1] "end_of_week"
## [2] "jurisdiction"
## [3] "age_group"
## [4] "sex"
## [5] "race_ethnicity_combined"
## [6] "case_count_suppressed"
## [7] "death_count_suppressed"
## [8] "case_crude_rate_suppressed_per_100k"
## [9] "death_crude_rate_suppressed_per_100k"
Let’s view some example data by selecting a random sample of 15 rows from the dataset. (The CDC also provides an online data preview for this dataset.)
sample_n(covid, 15)
## # A tibble: 15 × 9
## end_of_week jurisdiction age_group sex race_ethnicity_combined
## <chr> <chr> <chr> <chr> <chr>
## 1 11/12/22 Region 5 30 - 39 Years Overall Black, NH
## 2 10/30/21 Region 5 Overall Male Overall
## 3 4/15/23 Region 5 50 - 64 Years Overall White, NH
## 4 4/22/23 Region 5 40 - 49 Years Overall AI/AN, NH
## 5 12/25/21 Region 5 18 - 29 Years Female White, NH
## 6 9/17/22 Region 5 75+ Years Overall Asian/PI, NH
## 7 7/1/23 Region 5 Overall Female Hispanic
## 8 1/15/22 Region 5 30 - 39 Years Overall Hispanic
## 9 10/30/21 Region 5 16 - 17 Years Male Overall
## 10 6/17/23 Region 5 5 - 11 Years Female Asian/PI, NH
## 11 11/11/23 Region 5 65 - 74 Years Overall Hispanic
## 12 11/11/23 Region 5 65 - 74 Years Male Asian/PI, NH
## 13 11/11/23 Region 5 75+ Years Male Overall
## 14 3/27/21 Region 5 75+ Years Female Black, NH
## 15 2/25/23 Region 5 Overall Overall Hispanic
## # ℹ 4 more variables: case_count_suppressed <dbl>,
## # death_count_suppressed <dbl>, case_crude_rate_suppressed_per_100k <dbl>,
## # death_crude_rate_suppressed_per_100k <dbl>
Notice that the data for “End of Week” has been read in as a character format. Let’s convert that column to a date format to help with our analysis.
covid$end_of_week <- as.Date(covid$end_of_week, format="%m/%d/%y")
Next, let’s utilize the documentation for this dataset to better understand some of the columns and values seen in the data.
The original dataset includes weekly COVID-19 data from across the United States and the US territories. One might expect the dataset to have a column for state or territory (with values such as: Alabama, Alaska, … Puerto Rico, US Virgin Islands).
Instead, the dataset has a column named jurisdiction
which includes values for US or a numbered region:
Region 1, Region 2, Region 3, …
Region 10.
While the value of US implies summary data for the
national level, it is not immediately clear how the 10 regions are
defined. Unfortunately, the CDC documentation for the dataset does
not describe these 10 regions.
However, the documentation does link to a CDC COVID Data Tracker, which happens to include a small inset map of the United States divided into 10 numbered regions (annotated screenshot below).
Expanding the footnotes section at the end of the COVID Data Tracker page reveals a reference to the US Department of Health and Human Services (HHS) having 10 regions, and the footnote provides a link to a HHS page with a map and description of the 10 regions (screenshot excerpt below).
Finally, it’s clear which states and territories belong to the 10
regions used as values for jurisdiction in the COVID-19
dataset. However, getting to this information was somewhat
haphazard.
RECOMMENDATION: The CDC documentation for the dataset should provide a direct link to the HHS page describing the 10 regions used as values for
jurisdiction.
⚠️ Note: The original dataset contains over 400,000 rows of data for these 10 regions, plus national summary data. However, for this series of data dives, the dataset was reduced to only include data for Region 5, which consists of the states of Illinois, Indiana, Michigan, Minnesota, Ohio, and Wisconsin. This revised data set has approximately 38,000 rows.
The dataset has a column named race_ethnicity_combined
to categorize COVID-19 patients according to their race and
ethnicity.
The CDC documentation does not explain how the race/ethnicity groups were determined. However, the values used in this dataset can be deciphered to provide hints:
AI/AN, NH = American Indian/Alaska Native,
Non-HispanicAsian/PI, NH = Asian/Pacific Islander,
Non-HispanicBlack, NH = Black/African American, Non-HispanicHispanic = HispanicWhite, NH = White, Non-HispanicOverall = all persons of any race/ethnicityThe US federal government has standards as to how data for race and ethnicity are collected, categorized, and used, in order to promote comparability of data among federal data systems. In general, federal agencies follow the “Revisions to the Standards for the Classification of Federal Data on Race and Ethnicity” issued in 1997 by the federal Office of Management and Budget (OMB).
In general, race and ethnicity are collected as separate variables.
For ethnicity, a person can select “Hispanic or Latino” or “Not Hispanic or Latino” (or potentially “Don’t know” or “Declined to answer”).
For race, a person can select one or more races (and/or “Don’t know” or “Declined to answer”).
In this CDC dataset, if a person identified as Hispanic, they were
classified as Hispanic for their race/ethnicity group
regardless of which race(s) they identified with.
The other race/ethnicity groups represent persons who identified as Non-Hispanic and as one race group.
The CDC documentation does explain that persons who had missing
race/ethnicity information or who identified as Non-Hispanic and
multiple races are not listed as a separate demographic
group. However, their cases and deaths are included as part of the
Overall totals.
RECOMMENDATION: The CDC documentation for the dataset should provide an explanation of how race and ethnicity are combined to classify persons into groups for data purposes.
The dataset provides weekly data for cases of COVID-19 and deaths from COVID-19. Each row lists aggregated data for a specific demographic group (such as: Hispanic females ages 20-29) for a specific week (such as: week ending 3/7/2020).
The dataset has columns for case count and death count, as well as case rate and death rate.
case_count_suppresseddeath_count_suppressedcase_crude_rate_suppressed_per_100kdeath_crude_rate_suppressed_per_100kThe counts represent the number of cases (or deaths) that were reported in the given week. The CDC documentation indicates that rates are the number of cases (or deaths) that occurred per 100,000 population.
Thus, the weekly rate is based on the weekly count but takes population size into account. This allows for valid comparison of incidence between different groups that may vary in population size.
\[ Rate = \frac{Count}{Population\ Size} \times 100,000 \]
For example:
What the CDC documentation does not explain is how the population size was determined for the demographic groups. One would presume that US Census population data was utilized. (The most recent decennial census was conducted in 2020, and its data released in April 2021.) One might also wonder whether the CDC accounted for changes in population counts over time (i.e., births, deaths, migration) since this COVID data was collected over a period of time from March 2020 until November 2023.
A potential clue (that may or may not apply to our dataset) comes from another footnote in the aforementioned CDC COVID Data Tracker page. The footnote explains that the CDC calculated rates for the COVID Data Tracker using a set of July 1, 2021 “Blended Base” population estimates from the US Census Bureau. The footnote links to a US Census Bureau paper describing the methodology it uses to update population estimates annually (every July 1) based on births, deaths, and migration that have occurred since the last decennial census (e.g., the 2020 Census).
RECOMMENDATION: While the CDC documentation for the dataset provides some clarity on what the rates represent, the documentation could be improved by clarifying how the population sizes for groups were determined for the rate calculations.
The column names for the counts and rates of cases and deaths include
the descriptor suppressed.
case_count_suppresseddeath_count_suppressedcase_crude_rate_suppressed_per_100kdeath_crude_rate_suppressed_per_100kThe CDC documentation explains that if the weekly cumulative count
for cases (or deaths) for a specific demographic group was 5 or
fewer, the count was suppressed in order to
protect the confidentiality of patients. These suppressed counts show up
in the dataset as an explicit missing value
(NA).
Because the rates are calculated from the counts, if the count for a
specific demographic group in a given week was suppressed, then the
corresponding rate for that demographic group will also be suppressed
for that week. In other words, within a row, a NA value for
a count of cases or deaths results in a NA value for the
corresponding rate.
Thus, there are four possible scenarios for each row in the dataset:
##) and
deaths (##).##) but suppressed death
data (NA).##) but suppressed case
data (NA).NA)
and deaths (NA).Using these scenarios, let’s determine the data completeness for case and death data across the rows.
# Total number of rows in dataset
total_rows <- nrow(covid)
# Number of rows with complete data for both cases and deaths
complete_rows <- covid |>
filter(!is.na(case_count_suppressed) & !is.na(case_crude_rate_suppressed_per_100k) & !is.na(death_count_suppressed) & !is.na(death_crude_rate_suppressed_per_100k)) |>
nrow()
# Number of rows with case data but suppressed death data (death count <= 5)
case_only_rows <- covid |>
filter(!is.na(case_count_suppressed) & !is.na(case_crude_rate_suppressed_per_100k) & is.na(death_count_suppressed) & is.na(death_crude_rate_suppressed_per_100k)) |>
nrow()
# Number of rows with death data but suppressed case data (case count <= 5)
death_only_rows <- covid |>
filter(is.na(case_count_suppressed) & is.na(case_crude_rate_suppressed_per_100k) & !is.na(death_count_suppressed) & !is.na(death_crude_rate_suppressed_per_100k)) |>
nrow()
# Number of rows with suppressed data for both cases and deaths (counts <= 5)
suppressed_rows <- covid |>
filter(is.na(case_count_suppressed) & is.na(case_crude_rate_suppressed_per_100k) & is.na(death_count_suppressed) & is.na(death_crude_rate_suppressed_per_100k)) |>
nrow()
# Create dataframe with row counts for different scenarios of data completeness
# Note: using leading spaces in strings to order scenarios in plot
data_completeness <- c("Both Case and Death Data", " Case Data but Deaths Suppressed", " Death Data but Cases Suppressed", " Both Cases and Deaths Suppressed")
num_rows <- c(complete_rows, case_only_rows, death_only_rows, suppressed_rows)
data_comp_df <- data.frame(data_completeness, num_rows)
# Plot bar chart to show data completeness
data_comp_df |>
ggplot() +
geom_bar(mapping = aes(x = data_completeness, y = num_rows), stat = "identity") +
geom_text(aes(x = data_completeness, y = num_rows, label = str_c(num_rows, " (", round(num_rows / total_rows * 100, digits = 1), "%)")), hjust = -0.2) +
coord_flip() +
ylim(0, 36000) +
labs(title = "Data Completeness Within Rows", x = "", y = "Number of Rows") +
theme_classic()
As it turns out, only about a quarter of the rows (~24%) have complete data for both cases and deaths. Most of the rows (~71%) have only case data, with deaths suppressed. A small proportion of rows (about 5%) have both cases and deaths suppressed.
As explained previously, if the weekly cumulative count for cases (or
deaths) for a specific demographic group was 5 or
fewer, the count was suppressed in order to
protect the confidentiality of patients. These suppressed counts show up
in the dataset as an explicit missing value
(NA).
As we saw earlier, most rows (~71%) have only case data with deaths
suppressed (NA). A small proportion of rows (~5%) have both
cases and deaths suppressed.
Interestingly, there is another form of missing data in this dataset: implicit missing rows.
Each row in the dataset presents weekly data for a specific demographic group based on a combination of age group, sex at birth, and race/ethnicity group.
0-4 Years, 5-11 Years,
12-15 Years, 16-17 Years,
18-29 Years, 30-39 Years,
40-49 Years, 50-64 Years,
65-74 Years, 75+ Years, or
Overall (all ages combined)Female, Male, or Overall (all
sexes combined)AI/AN, NH, Asian/PI, NH,
Black, NH, Hispanic, White, NH,
or Overall (all race/ethnicity combined)For each week in the dataset, there should be a separate row with weekly COVID-19 data for each possible demographic group combination. For example, one possible demographic group combination would be:
Based on the number of possible values for the three demographic variables, the number of different demographic groups combinations in this dataset is:
\[ 11 \times 3 \times 6 = 198 \]
As we saw previously, even if a specific demographic group has
suppressed data (NA) for both cases and deaths in a given
week, it is still included as a row within the dataset, albeit a row
with missing numeric data.
So we would expect the dataset to have 198 rows of COVID-19 data for each week: one row for each possible demographic group combination.
Let’s verify if this is true or not.
# Count actual number of rows for each week
num_weekly_rows <- covid |>
group_by(end_of_week) |>
summarise(
total_rows = n()
)
# Plot histogram to show distribution of number of rows per week
num_weekly_rows |>
ggplot() +
geom_histogram(mapping = aes(x = total_rows), binwidth = 1, color = 'white') +
geom_vline(xintercept = 198, color = 'orange') +
labs(title = "COVID-19 Weekly Data (March 7, 2020 - November 18, 2023)",
subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
x = "Number of Rows of Data per Week", y = "Count of Weeks") +
annotate("text", x = 187, y = 110, label = "Expected # Rows = 198", color = 'orange') +
theme_classic() +
theme(plot.subtitle = element_text(colour = "darkgray"))
# Number of weeks in dataset
total_weeks <- nrow(num_weekly_rows)
# Number of weeks with all 198 rows for every demographic combination
weeks_198_rows <- nrow(filter(num_weekly_rows, total_rows == 198))
# Percent of weeks with all 198 rows for every demographic combination
percent_weeks_198_rows <- round(weeks_198_rows / total_weeks * 100, digits = 1)
cat("Weeks with 198 Rows =", percent_weeks_198_rows, "%\n")
## Weeks with 198 Rows = 57.2 %
As the histogram shows, many of the weeks in the dataset do not have the complete set of 198 rows for every demographic group combination.
What does it mean if a particular demographic combination is not included in the dataset for a given week?
Let’s look at an example for a specific week.
16-17 Years, Sex = Male, Race/Ethnicity Group
= Hispanic] which has NA for both case and
death data. Presumably, this represents suppressed data where the counts
for cases and deaths were both 5 or fewer for this particular
demographic group.16-17 Years, Sex = Male,
Race/Ethnicity Group = AI/AN, NH]. Does the absence of this
specific demographic group imply that there were zero
cases and zero deaths for that group in this
given week?The mystery here is whether the implicit missing rows have any meaning at all.
The CDC documentation for this dataset does not
provide any explanation for the absence of a specific
demographic group in a given week, compared to the inclusion of another
demographic group for the same week that has NA values for
its case and death data. However, we do know that 43% of the weeks in
the dataset have one or more missing rows for certain demographic
groups.
RECOMMENDATION: The CDC documentation for the dataset should explain what a missing row for a demographic group means, as compared to an included row that has missing/suppressed data.
This dataset will be explored further in subsequent data dives.