COVID-19 Weekly Cases and Deaths by Age, Race/Ethnicity, and Sex (Mar 2020 - Nov 2023)

This data dive will focus on the documentation associated with a CDC dataset that provides weekly data of COVID-19 cases and deaths reported in the United States from March 7, 2020 through November 18, 2023.

The dataset was downloaded from the CDC Data Catalog, which provides documentation describing this dataset.

Let’s examine some ways in which the provided documentation helps clarify this dataset, as well as some ways in which the documentation could be improved.

Preview Dataset

To get started, let’s load a few R packages to assist with our analysis.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)

Next, let’s read in the dataset from CSV.

covid <- read_delim("./COVID_weekly_cases_deaths_region5.csv", delim = ",")

## Rows: 37867 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): end_of_week, jurisdiction, age_group, sex, race_ethnicity_combined
## dbl (4): case_count_suppressed, death_count_suppressed, case_crude_rate_supp...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Let’s view the complete list of 9 column names in this dataset.

colnames(covid)

## [1] "end_of_week"                         
## [2] "jurisdiction"                        
## [3] "age_group"                           
## [4] "sex"                                 
## [5] "race_ethnicity_combined"             
## [6] "case_count_suppressed"               
## [7] "death_count_suppressed"              
## [8] "case_crude_rate_suppressed_per_100k" 
## [9] "death_crude_rate_suppressed_per_100k"

Let’s view some example data by selecting a random sample of 15 rows from the dataset. (The CDC also provides an online data preview for this dataset.)

sample_n(covid, 15)

## # A tibble: 15 × 9
##    end_of_week jurisdiction age_group     sex     race_ethnicity_combined
##    <chr>       <chr>        <chr>         <chr>   <chr>                  
##  1 11/12/22    Region 5     30 - 39 Years Overall Black, NH              
##  2 10/30/21    Region 5     Overall       Male    Overall                
##  3 4/15/23     Region 5     50 - 64 Years Overall White, NH              
##  4 4/22/23     Region 5     40 - 49 Years Overall AI/AN, NH              
##  5 12/25/21    Region 5     18 - 29 Years Female  White, NH              
##  6 9/17/22     Region 5     75+ Years     Overall Asian/PI, NH           
##  7 7/1/23      Region 5     Overall       Female  Hispanic               
##  8 1/15/22     Region 5     30 - 39 Years Overall Hispanic               
##  9 10/30/21    Region 5     16 - 17 Years Male    Overall                
## 10 6/17/23     Region 5     5 - 11 Years  Female  Asian/PI, NH           
## 11 11/11/23    Region 5     65 - 74 Years Overall Hispanic               
## 12 11/11/23    Region 5     65 - 74 Years Male    Asian/PI, NH           
## 13 11/11/23    Region 5     75+ Years     Male    Overall                
## 14 3/27/21     Region 5     75+ Years     Female  Black, NH              
## 15 2/25/23     Region 5     Overall       Overall Hispanic               
## # ℹ 4 more variables: case_count_suppressed <dbl>,
## #   death_count_suppressed <dbl>, case_crude_rate_suppressed_per_100k <dbl>,
## #   death_crude_rate_suppressed_per_100k <dbl>

Notice that the data for “End of Week” has been read in as a character format. Let’s convert that column to a date format to help with our analysis.

covid$end_of_week <- as.Date(covid$end_of_week, format="%m/%d/%y")

Next, let’s utilize the documentation for this dataset to better understand some of the columns and values seen in the data.

Jurisdiction

The original dataset includes weekly COVID-19 data from across the United States and the US territories. One might expect the dataset to have a column for state or territory (with values such as: Alabama, Alaska, … Puerto Rico, US Virgin Islands).

Instead, the dataset has a column named jurisdiction which includes values for US or a numbered region: Region 1, Region 2, Region 3, … Region 10.

While the value of US implies summary data for the national level, it is not immediately clear how the 10 regions are defined. Unfortunately, the CDC documentation for the dataset does not describe these 10 regions.

However, the documentation does link to a CDC COVID Data Tracker, which happens to include a small inset map of the United States divided into 10 numbered regions (annotated screenshot below).

Expanding the footnotes section at the end of the COVID Data Tracker page reveals a reference to the US Department of Health and Human Services (HHS) having 10 regions, and the footnote provides a link to a HHS page with a map and description of the 10 regions (screenshot excerpt below).

Finally, it’s clear which states and territories belong to the 10 regions used as values for jurisdiction in the COVID-19 dataset. However, getting to this information was somewhat haphazard.

RECOMMENDATION: The CDC documentation for the dataset should provide a direct link to the HHS page describing the 10 regions used as values for jurisdiction.

⚠️ Note: The original dataset contains over 400,000 rows of data for these 10 regions, plus national summary data. However, for this series of data dives, the dataset was reduced to only include data for Region 5, which consists of the states of Illinois, Indiana, Michigan, Minnesota, Ohio, and Wisconsin. This revised data set has approximately 38,000 rows.

Race/Ethnicity

The dataset has a column named race_ethnicity_combined to categorize COVID-19 patients according to their race and ethnicity.

The CDC documentation does not explain how the race/ethnicity groups were determined. However, the values used in this dataset can be deciphered to provide hints:

AI/AN, NH = American Indian/Alaska Native, Non-Hispanic
Asian/PI, NH = Asian/Pacific Islander, Non-Hispanic
Black, NH = Black/African American, Non-Hispanic
Hispanic = Hispanic
White, NH = White, Non-Hispanic
Overall = all persons of any race/ethnicity

The US federal government has standards as to how data for race and ethnicity are collected, categorized, and used, in order to promote comparability of data among federal data systems. In general, federal agencies follow the “Revisions to the Standards for the Classification of Federal Data on Race and Ethnicity” issued in 1997 by the federal Office of Management and Budget (OMB).

In general, race and ethnicity are collected as separate variables.

For ethnicity, a person can select “Hispanic or Latino” or “Not Hispanic or Latino” (or potentially “Don’t know” or “Declined to answer”).

For race, a person can select one or more races (and/or “Don’t know” or “Declined to answer”).

In this CDC dataset, if a person identified as Hispanic, they were classified as Hispanic for their race/ethnicity group regardless of which race(s) they identified with.

The other race/ethnicity groups represent persons who identified as Non-Hispanic and as one race group.

The CDC documentation does explain that persons who had missing race/ethnicity information or who identified as Non-Hispanic and multiple races are not listed as a separate demographic group. However, their cases and deaths are included as part of the Overall totals.

RECOMMENDATION: The CDC documentation for the dataset should provide an explanation of how race and ethnicity are combined to classify persons into groups for data purposes.

Counts vs Rates

The dataset provides weekly data for cases of COVID-19 and deaths from COVID-19. Each row lists aggregated data for a specific demographic group (such as: Hispanic females ages 20-29) for a specific week (such as: week ending 3/7/2020).

The dataset has columns for case count and death count, as well as case rate and death rate.

case_count_suppressed
death_count_suppressed
case_crude_rate_suppressed_per_100k
death_crude_rate_suppressed_per_100k

The counts represent the number of cases (or deaths) that were reported in the given week. The CDC documentation indicates that rates are the number of cases (or deaths) that occurred per 100,000 population.

Thus, the weekly rate is based on the weekly count but takes population size into account. This allows for valid comparison of incidence between different groups that may vary in population size.

\[ Rate = \frac{Count}{Population\ Size} \times 100,000 \]

For example:

If there were 500 new cases of COVID-19 reported during the week for a demographic group with a population of 200,000, the weekly case rate for that group would be 250 cases per 100,000 population.
Whereas if there were 500 new cases of COVID-19 reported that week for a different demographic group with a population of only 50,000, the weekly case rate for this second group would be 1000 cases per 100,000 population, which is a rate 4 times higher than the first group.

What the CDC documentation does not explain is how the population size was determined for the demographic groups. One would presume that US Census population data was utilized. (The most recent decennial census was conducted in 2020, and its data released in April 2021.) One might also wonder whether the CDC accounted for changes in population counts over time (i.e., births, deaths, migration) since this COVID data was collected over a period of time from March 2020 until November 2023.

A potential clue (that may or may not apply to our dataset) comes from another footnote in the aforementioned CDC COVID Data Tracker page. The footnote explains that the CDC calculated rates for the COVID Data Tracker using a set of July 1, 2021 “Blended Base” population estimates from the US Census Bureau. The footnote links to a US Census Bureau paper describing the methodology it uses to update population estimates annually (every July 1) based on births, deaths, and migration that have occurred since the last decennial census (e.g., the 2020 Census).

RECOMMENDATION: While the CDC documentation for the dataset provides some clarity on what the rates represent, the documentation could be improved by clarifying how the population sizes for groups were determined for the rate calculations.

Suppressed Data

The column names for the counts and rates of cases and deaths include the descriptor suppressed.

case_count_suppressed
death_count_suppressed
case_crude_rate_suppressed_per_100k
death_crude_rate_suppressed_per_100k

The CDC documentation explains that if the weekly cumulative count for cases (or deaths) for a specific demographic group was 5 or fewer, the count was suppressed in order to protect the confidentiality of patients. These suppressed counts show up in the dataset as an explicit missing value (NA).

Because the rates are calculated from the counts, if the count for a specific demographic group in a given week was suppressed, then the corresponding rate for that demographic group will also be suppressed for that week. In other words, within a row, a NA value for a count of cases or deaths results in a NA value for the corresponding rate.

Thus, there are four possible scenarios for each row in the dataset:

A row could have complete data for both cases (##) and deaths (##).
A row could have case data (##) but suppressed death data (NA).
A row could have death data (##) but suppressed case data (NA).
A row could have suppressed data for both cases (NA) and deaths (NA).

Using these scenarios, let’s determine the data completeness for case and death data across the rows.

# Total number of rows in dataset
total_rows <- nrow(covid)

# Number of rows with complete data for both cases and deaths
complete_rows <- covid |>
  filter(!is.na(case_count_suppressed) & !is.na(case_crude_rate_suppressed_per_100k) & !is.na(death_count_suppressed) & !is.na(death_crude_rate_suppressed_per_100k)) |>
  nrow()

# Number of rows with case data but suppressed death data (death count <= 5)
case_only_rows <- covid |>
  filter(!is.na(case_count_suppressed) & !is.na(case_crude_rate_suppressed_per_100k) & is.na(death_count_suppressed) & is.na(death_crude_rate_suppressed_per_100k)) |>
  nrow()

# Number of rows with death data but suppressed case data (case count <= 5)
death_only_rows <- covid |>
  filter(is.na(case_count_suppressed) & is.na(case_crude_rate_suppressed_per_100k) & !is.na(death_count_suppressed) & !is.na(death_crude_rate_suppressed_per_100k)) |>
  nrow()

# Number of rows with suppressed data for both cases and deaths (counts <= 5)
suppressed_rows <- covid |>
  filter(is.na(case_count_suppressed) & is.na(case_crude_rate_suppressed_per_100k) & is.na(death_count_suppressed) & is.na(death_crude_rate_suppressed_per_100k)) |>
  nrow()

# Create dataframe with row counts for different scenarios of data completeness
# Note: using leading spaces in strings to order scenarios in plot
data_completeness <- c("Both Case and Death Data", " Case Data but Deaths Suppressed", "  Death Data but Cases Suppressed", "   Both Cases and Deaths Suppressed")
num_rows <- c(complete_rows, case_only_rows, death_only_rows, suppressed_rows)
data_comp_df <- data.frame(data_completeness, num_rows)

# Plot bar chart to show data completeness
data_comp_df |>
  ggplot() +
  geom_bar(mapping = aes(x = data_completeness, y = num_rows), stat = "identity") + 
  geom_text(aes(x = data_completeness, y = num_rows, label = str_c(num_rows, " (", round(num_rows / total_rows * 100, digits = 1), "%)")), hjust = -0.2) +
  coord_flip() +
  ylim(0, 36000) +
  labs(title = "Data Completeness Within Rows", x = "", y = "Number of Rows") +
  theme_classic()

As it turns out, only about a quarter of the rows (~24%) have complete data for both cases and deaths. Most of the rows (~71%) have only case data, with deaths suppressed. A small proportion of rows (about 5%) have both cases and deaths suppressed.

Implicit Missing Rows

As explained previously, if the weekly cumulative count for cases (or deaths) for a specific demographic group was 5 or fewer, the count was suppressed in order to protect the confidentiality of patients. These suppressed counts show up in the dataset as an explicit missing value (NA).

As we saw earlier, most rows (~71%) have only case data with deaths suppressed (NA). A small proportion of rows (~5%) have both cases and deaths suppressed.

Interestingly, there is another form of missing data in this dataset: implicit missing rows.

Each row in the dataset presents weekly data for a specific demographic group based on a combination of age group, sex at birth, and race/ethnicity group.

Age Group has 11 possible values in this dataset:
- 0-4 Years, 5-11 Years, 12-15 Years, 16-17 Years, 18-29 Years, 30-39 Years, 40-49 Years, 50-64 Years, 65-74 Years, 75+ Years, or Overall (all ages combined)
Sex At Birth has 3 possible values in this dataset:
- Female, Male, or Overall (all sexes combined)
Race/Ethnicity Group has 6 possible values in this dataset:
- AI/AN, NH, Asian/PI, NH, Black, NH, Hispanic, White, NH, or Overall (all race/ethnicity combined)

For each week in the dataset, there should be a separate row with weekly COVID-19 data for each possible demographic group combination. For example, one possible demographic group combination would be:

Age Group = 18-29 Years, Sex = Female, Race/Ethnicity Group = Hispanic

Based on the number of possible values for the three demographic variables, the number of different demographic groups combinations in this dataset is:

\[ 11 \times 3 \times 6 = 198 \]

As we saw previously, even if a specific demographic group has suppressed data (NA) for both cases and deaths in a given week, it is still included as a row within the dataset, albeit a row with missing numeric data.

So we would expect the dataset to have 198 rows of COVID-19 data for each week: one row for each possible demographic group combination.

Let’s verify if this is true or not.

# Count actual number of rows for each week
num_weekly_rows <- covid |>
  group_by(end_of_week) |>
  summarise(
    total_rows = n()
  )

# Plot histogram to show distribution of number of rows per week
num_weekly_rows |>
  ggplot() +
  geom_histogram(mapping = aes(x = total_rows), binwidth = 1, color = 'white') +
  geom_vline(xintercept = 198, color = 'orange') +
  labs(title = "COVID-19 Weekly Data (March 7, 2020 - November 18, 2023)",
       subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
       x = "Number of Rows of Data per Week", y = "Count of Weeks") +
  annotate("text", x = 187, y = 110, label = "Expected # Rows = 198", color = 'orange') +
  theme_classic() +
  theme(plot.subtitle = element_text(colour = "darkgray"))

# Number of weeks in dataset
total_weeks <- nrow(num_weekly_rows)

# Number of weeks with all 198 rows for every demographic combination
weeks_198_rows <- nrow(filter(num_weekly_rows, total_rows == 198))

# Percent of weeks with all 198 rows for every demographic combination
percent_weeks_198_rows <- round(weeks_198_rows / total_weeks * 100, digits = 1)

cat("Weeks with 198 Rows =", percent_weeks_198_rows, "%\n")

## Weeks with 198 Rows = 57.2 %

As the histogram shows, many of the weeks in the dataset do not have the complete set of 198 rows for every demographic group combination.

What does it mean if a particular demographic combination is not included in the dataset for a given week?

Let’s look at an example for a specific week.

For the week ending 3/21/2020, there is a row for [Age Group = 16-17 Years, Sex = Male, Race/Ethnicity Group = Hispanic] which has NA for both case and death data. Presumably, this represents suppressed data where the counts for cases and deaths were both 5 or fewer for this particular demographic group.
However, for this same week, there is no row for [Age Group = 16-17 Years, Sex = Male, Race/Ethnicity Group = AI/AN, NH]. Does the absence of this specific demographic group imply that there were zero cases and zero deaths for that group in this given week?

The mystery here is whether the implicit missing rows have any meaning at all.

The CDC documentation for this dataset does not provide any explanation for the absence of a specific demographic group in a given week, compared to the inclusion of another demographic group for the same week that has NA values for its case and death data. However, we do know that 43% of the weeks in the dataset have one or more missing rows for certain demographic groups.

RECOMMENDATION: The CDC documentation for the dataset should explain what a missing row for a demographic group means, as compared to an included row that has missing/suppressed data.

This dataset will be explored further in subsequent data dives.

H510 Week 5 Data Dive

Michael Frontz

2024-02-12