COVID-19 Weekly Cases and Deaths by Age, Race/Ethnicity, and Sex (Mar 2020 - Nov 2023)

Background

The dataset explored in this data dive comes from weekly data submitted to the Centers for Disease Control and Prevention (CDC) for COVID-19 cases and deaths for the week ending March 7, 2020 through the week ending November 18, 2023.

The CDC summarized weekly reports of cases and deaths (representing individual patients) into weekly counts and rates (i.e., number per 100,000 population) categorized by region, age group, race/ethnicity group, and sex at birth. Weekly counts with five or fewer cases or deaths were not included (i.e., were suppressed) in the dataset by the CDC, in order to protect the confidentiality of patients.

The original archived dataset was downloaded from the CDC Data Catalog and contains over 400,000 rows of data for the entire United States (including US territories), which are categorized into 10 regions by the Department of Health and Human Services (HHS).

Department of Health and Human Services (HHS) Regions

Focus on Region 5 Data

⚠️ For the purposes of this data dive, the dataset has been reduced to only include data for Region 5, which consists of the states of Illinois, Indiana, Michigan, Minnesota, Ohio, and Wisconsin. This results in a revised data set of about 38,000 rows.

Load Dataset

First, load tidyverse, which is a collection of R packages designed for data science.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Next, read in the dataset, which is saved as a CSV file.

covid <- read_delim("./COVID_weekly_cases_deaths_region5.csv", delim = ",")

## Rows: 37867 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): end_of_week, jurisdiction, age_group, sex, race_ethnicity_combined
## dbl (4): case_count_suppressed, death_count_suppressed, case_crude_rate_supp...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The dataset contains 37,867 rows and 9 columns: 5 columns of text data (chr) and 4 columns of numeric data (dbl).

Let’s view the full list of column names.

colnames(covid)

## [1] "end_of_week"                         
## [2] "jurisdiction"                        
## [3] "age_group"                           
## [4] "sex"                                 
## [5] "race_ethnicity_combined"             
## [6] "case_count_suppressed"               
## [7] "death_count_suppressed"              
## [8] "case_crude_rate_suppressed_per_100k" 
## [9] "death_crude_rate_suppressed_per_100k"

The data for “End of Week” will be converted from a character format to a date format, which will help with certain analyses and visualizations.

class(covid$end_of_week)

## [1] "character"

covid$end_of_week <- as.Date(covid$end_of_week, format="%m/%d/%y")
class(covid$end_of_week)

## [1] "Date"

Numeric Summaries

The COVID-19 dataset for Region 5 (IL, IN, MI, MN, OH, WI) provides weekly counts of cases and deaths, as well as weekly rates of cases and deaths.

The data include overall counts and rates for each week, as well as weekly breakdowns by age group, sex at birth, and race/ethnicity group, allowing for a detailed demographic view of how COVID-19 affected the population from March 2020 through November 18, 2023.

Individuals with Unknown Demographic Data

⚠️ The CDC included individuals with unknown or missing data for age, sex, or race/ethnicity in the overall weekly case and death counts, but those individuals are not represented in one or more demographic breakout groups (depending on which demographic data was unknown for that individual).

For example, if an individual who died from COVID-19 was of unknown age, their case would be included in the overall weekly death count but not be represented in an age group.
However, if the sex and race/ethnicity of that same individual were known, they would be represented in the weekly death counts for their corresponding sex group and race/ethnicity group.

Overall Weekly Case Counts

# Filter data to focus on Overall totals for each week
weekly_totals_overall <- covid |>
  filter(age_group == "Overall" & sex == "Overall" & race_ethnicity_combined == "Overall")

# Summary of Weekly COVID Cases in Region 5 (March 7, 2020 - Nov 18, 2023)
summary(weekly_totals_overall$case_count_suppressed)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1086   26862   57134   85860  105618  573847

The overall weekly case counts have a wide distribution with a large difference between the median (~57K cases per week) and mean (~86K cases per week), indicating the data distribution is skewed.

Overall Weekly Case Rates

The weekly case rates are calculated from the weekly case counts. The rate indicates the number of cases per 100,000 population. The overall rates are calculated for the total population.

# Summary of Weekly COVID Case Rates in Region 5 (March 7, 2020 - Nov 18, 2023)
summary(weekly_totals_overall$case_crude_rate_suppressed_per_100k)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.07   51.13  108.74  163.41  201.01 1092.17

When rates are calculated for a subgroup (such as Hispanic females age 30-39 years), the rate is calculated based on the population of that specific subgroup. Rates allow for an “apples to apples” comparison among subgroups, despite their differing population sizes.

Overall Weekly Death Counts

# Summary of Weekly COVID Deaths in Region 5 (March 7, 2020 - Nov 18, 2023)
summary(weekly_totals_overall$death_count_suppressed)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.0   225.0   445.0   938.9  1308.0  4863.0

The overall weekly death counts also have a wide distribution with a large difference between the median (445 deaths per week) and mean (~939 deaths per week), indicating the data distribution is skewed.

Overall Weekly Death Rates

The weekly death rates are calculated from the weekly death counts. The rate indicates the number of deaths per 100,000 population. The overall rates are calculated for the total population.

# Summary of Weekly COVID Death Rates in Region 5 (March 7, 2020 - Nov 18, 2023)
summary(weekly_totals_overall$death_crude_rate_suppressed_per_100k)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.020   0.430   0.850   1.787   2.490   9.260

Total Cases and Deaths by Age Group

# Filter data to focus on totals by Age Group per week (March 7, 2020 - Nov 18, 2023)
covid |>
  filter(race_ethnicity_combined == "Overall" & sex == "Overall") |>
  group_by(age_group) |>
  summarise(case_count = sum(case_count_suppressed, na.rm = TRUE),
            death_count = sum(death_count_suppressed, na.rm = TRUE)
  ) |>
  arrange(desc(death_count), .by_group = TRUE)

## # A tibble: 11 × 3
##    age_group     case_count death_count
##    <chr>              <dbl>       <dbl>
##  1 Overall         16656806      182153
##  2 75+ Years        1170147      104685
##  3 65 - 74 Years    1337530       39601
##  4 50 - 64 Years    3188286       28509
##  5 40 - 49 Years    2299271        5597
##  6 30 - 39 Years    2644463        2171
##  7 18 - 29 Years    3273449         666
##  8 0 - 4 Years       588130           7
##  9 12 - 15 Years     703991           0
## 10 16 - 17 Years     423796           0
## 11 5 - 11 Years     1005827           0

The summary data above is sorted by death count (highest to lowest), clearly showing an increase in the number of deaths with age group. For individuals under 18 years of age, there were only 7 total deaths from COVID-19 in Region 5 (IL, IN, MI, MN, OH, WI).

Total Cases and Deaths by Sex

# Filter data to focus on totals by Sex per week (March 7, 2020 - Nov 18, 2023)
covid |>
  filter(age_group == "Overall" & race_ethnicity_combined == "Overall") |>
  group_by(sex) |>
  summarise(case_count = sum(case_count_suppressed, na.rm = TRUE),
            death_count = sum(death_count_suppressed, na.rm = TRUE)
  ) |>
  arrange(desc(death_count), .by_group = TRUE)

## # A tibble: 3 × 3
##   sex     case_count death_count
##   <chr>        <dbl>       <dbl>
## 1 Overall   16656806      182153
## 2 Male       7508422       97517
## 3 Female     8950447       84475

The summary data above shows that more females had cases of COVID-19, but more males actually died from COVID-19.

Total Cases and Deaths by Race/Ethnicity

How the CDC Determines Race/Ethnicity Groups

Although data for a patient’s Ethnicity and Race are typically collected as separate demographic questions, the CDC typically combines these fields for data analysis purposes by categorizing individuals as belong to a particular Race/Ethnicity group.

If an individual identifies as Hispanic/Latino for ethnicity, the CDC categorizes that individual’s Race/Ethnicity group as Hispanic/Latino, regardless of which race(s) that individual identifies with.

For example, in this dataset, Hispanic will include individuals who identify as Hispanic/Latino and White; individuals who identify as Hispanic/Latino and Black/African American; individuals who identify as Hispanic/Latino, Black/African American, and White; etc.

The CDC then defines the other groups as individuals who are Non-Hispanic and identify with a single race designation, such as:

American Indian/Alaska Native, Non-Hispanic (AI/AN, NH)
Asian/Pacific Islander, Non-Hispanic (Asian/PI, NH)
Black/African American, Non-Hispanic (Black, NH)
White, Non-Hispanic (White, NH)

Non-Hispanic Individuals Who Identify as Multiple Races

⚠️ If an individual identifies as Non-Hispanic and more than one race, the CDC will typically categorize the individual as Multiracial for data analysis purposes. However, this dataset does not include Multiracial as a breakout group within the Race/Ethnicity data. Instead, the CDC has included counts for individuals who identify as Non-Hispanic and multiple races as part of the overall weekly counts.

As a reminder, individuals with unknown or missing data for race and ethnicity are also included as part of the overall weekly counts but are also not represented as a breakout group for Race/Ethnicity.

Counts of Cases and Deaths by Race/Ethnicity

Given that background, here are the counts for total cases and deaths by race/ethnicity.

# Filter data to focus on totals by Race/Ethnicity per week (March 7, 2020 - Nov 18, 2023)
covid |>
  filter(age_group == "Overall" & sex == "Overall") |>
  group_by(race_ethnicity_combined) |>
  summarise(case_count = sum(case_count_suppressed, na.rm = TRUE),
            death_count = sum(death_count_suppressed, na.rm = TRUE)
  ) |>
  arrange(desc(death_count), .by_group = TRUE)

## # A tibble: 6 × 3
##   race_ethnicity_combined case_count death_count
##   <chr>                        <dbl>       <dbl>
## 1 Overall                   16656806      182153
## 2 White, NH                  8609844      134173
## 3 Black, NH                  1409467       20868
## 4 Hispanic                   1215287        8805
## 5 Asian/PI, NH                366237        2580
## 6 AI/AN, NH                    64169         479

To better understand how COVID-19 may have disproportionately impacted these race/ethnicity groups, it would be useful to compare their case rates and death rates because rates account for the differences in population sizes among groups.

However, even these counts for cases and deaths clearly reveal disparities in the impact of COVID-19 among race/ethnicity groups. For example, the groups for Black, NH and Hispanic had relatively similar total case counts (~141K cases vs. ~121K cases, respectively); however, the death counts for Black individuals (~21K deaths) were more than double those of Hispanic individuals (~9K deaths).

# How many deaths were for individuals who were Non-Hispanic multiple races or unknown race/ethnicity 
deaths_overall <- sum(weekly_totals_overall$death_count_suppressed)

deaths_known_race_ethnicity <- covid |>
  filter(age_group == "Overall" & sex == "Overall" & race_ethnicity_combined != "Overall")

deaths_unknown_race_ethnicity <- deaths_overall - sum(deaths_known_race_ethnicity$death_count_suppressed, na.rm = TRUE)

deaths_unknown_race_ethnicity

## [1] 15248

# What percentage of deaths were for individuals who were Non-Hispanic multiple races or unknown race/ethnicity
cat(round(deaths_unknown_race_ethnicity / deaths_overall * 100, digits = 1), "%")

## 8.4 %

Additionally, by calculating the difference in the overall death count compared to the sum of the death counts for the race/ethnicity breakout groups, we can determine that there were 15,248 deaths (8.4% of the overall deaths) that represent individuals who either identified as multiple, Non-Hispanic races or had unknown data for race/ethnicity.

Potential Questions

This initial exploration of the dataset raises a number of questions to potentially explore further, including but not limited to:

What is the relationship between cases and deaths over time? For example, what is the lag time between an increase in cases and an subsequent increase in deaths? (How might this pattern be used to anticipate and reduce the number of deaths?)
What disparities exist in how COVID-19 affected different demographic groups of people over time? For example, which groups were most affected or least affected? (What social factors might be responsible for these differences, and how might we reduce the resulting disparities?)
How might the patterns found in the Region 5 data (IL, IN, MI, MN, OH, WI) compare with data for other regions of the United States? For example, did some regions have higher disparities or lower disparities among demographic groups? (How might these comparisons be used to identify ways to lower the impact or disparities across all regions?)

Visual Summaries

Weekly Deaths by Sex Over Time

The data can be aggregrated (grouped) by sex (females vs. males) and plotted as a multi-line graph to explore how their patterns in weekly deaths compare over time.

library(ggthemes)
covid |>
  filter(age_group == "Overall" & race_ethnicity_combined == "Overall" & sex != "Overall") |>
  group_by(sex) |>
  ggplot() +
  geom_line(mapping = aes(x = end_of_week, y = death_count_suppressed, group = sex, color = sex), na.rm = TRUE) +
  theme_hc() + 
  labs(title = "COVID-19 Deaths (March 7, 2020 - November 18, 2023)",
       subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
       x = "", y = "Death Count") +
  theme(plot.subtitle = element_text(colour = "darkgray")) +
  theme(legend.position = "bottom")

The graph shows the lines for weekly deaths by sex closely mirror each other over time; however, the number of deaths among males is typically higher, especially during peaks of deaths.

Distribution of Weekly Death Counts

As seen in the previous graph, there were numerous peaks of increased weekly deaths. A histogram can be generated to better analyze the distribution of the overall weekly death counts.

median_death_count <- median(weekly_totals_overall$death_count_suppressed)
mean_death_count <- mean(weekly_totals_overall$death_count_suppressed)

weekly_totals_overall |>
  ggplot() +
  geom_histogram(mapping = aes(x = death_count_suppressed), binwidth = 100, color = 'white') +
  geom_vline(xintercept = median_death_count, color = 'orange') +
  geom_vline(xintercept = mean_death_count, color = 'red') +
  labs(title = "COVID-19 Weekly Death Counts (March 7, 2020 - November 18, 2023)",
       subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
       x = "Weekly Death Count", y = "Number of Weeks") +
  annotate("text", x = 120, y = 30, label = "Median", color = 'orange') +
  annotate("text", x = 1300, y = 30, label = "Average", color = 'red') +
  theme_classic() +
  theme(plot.subtitle = element_text(colour = "darkgray"))

The histogram shows the distribution in weekly death counts is skewed, with most weeks having death counts towards the “lower” end (though these still represent hundreds of deaths weekly) and a long tail extending to the right, for weeks with much higher death counts (representing thousands of deaths weekly).

This dataset will be explored further in subsequent data dives.

H510 Week 2 Data Dive

Michael Frontz

2024-01-22