COVID-19 Weekly Cases and Deaths by Age, Race/Ethnicity, and Sex (Mar 2020 - Nov 2023)

The CDC dataset explored in this data dive contains weekly counts and rates of COVID-19 cases and deaths reported in Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, and Wisconsin) of the United States from March 7, 2020 through November 18, 2023.

Prepare Dataset

To get started, let’s load tidyverse to assist with our data analysis.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Next, let’s read in the dataset from CSV.

covid <- read_delim("./COVID_weekly_cases_deaths_region5.csv", delim = ",")
## Rows: 37867 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): end_of_week, jurisdiction, age_group, sex, race_ethnicity_combined
## dbl (4): case_count_suppressed, death_count_suppressed, case_crude_rate_supp...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The data for “End of Week” will be converted from a character format to a date format to help with analyzing and visualizing the data properly.

covid$end_of_week <- as.Date(covid$end_of_week, format="%m/%d/%y")

Let’s rename some of the columns to make them more concise (for coding convenience).

covid <- covid |>
  rename(race_ethnicity = race_ethnicity_combined,
         case_count = case_count_suppressed,
         death_count = death_count_suppressed,
         case_rate = case_crude_rate_suppressed_per_100k,
         death_rate = death_crude_rate_suppressed_per_100k
  )

Finally, let’s add a new column to help order the age groups correctly, and then arrange all the rows in nested order by week, age, sex, and race/ethnicity.

# Add new column to order age groups correctly
covid <- covid |>
  mutate(age_order = case_when(
    age_group == "Overall" ~ 0L,
    age_group == "0 - 4 Years" ~ 1L,
    age_group == "5 - 11 Years" ~ 2L,
    age_group == "12 - 15 Years" ~ 3L,
    age_group == "16 - 17 Years" ~ 4L,
    age_group == "18 - 29 Years" ~ 5L,
    age_group == "30 - 39 Years" ~ 6L,
    age_group == "40 - 49 Years" ~ 7L,
    age_group == "50 - 64 Years" ~ 8L,
    age_group == "65 - 74 Years" ~ 9L,
    age_group == "75+ Years" ~ 10L),
    .after = age_group) |>
  arrange(end_of_week, age_order, sex, race_ethnicity)

Overall Cases and Deaths

Let’s create a dataframe for the overall totals for each week.

# Overall Totals for Each Week
weekly_totals_overall <- covid |>
  filter(age_group == "Overall" & sex == "Overall" & race_ethnicity == "Overall")

# Display total counts for cases and deaths & mean rates for cases and deaths
weekly_totals_overall |>
  summarise(cases = sum(case_count),
            deaths = sum(death_count),
            mean_case_rate = mean(case_rate),
            mean_death_rate = mean(death_rate)
  )
## # A tibble: 1 × 4
##      cases deaths mean_case_rate mean_death_rate
##      <dbl>  <dbl>          <dbl>           <dbl>
## 1 16656806 182153           163.            1.79

The summary above shows the overall number of cases and deaths, as well as the overall mean weekly rates for cases and deaths.

The weekly rates are calculated from the weekly counts. The rate indicates the number of cases (or deaths) per 100,000 population.

These overall mean rates can serve as a reference point for identifying demographic subgroups that have a mean rate higher or lower than the overall mean rate.

Data by Demographic Groups

Cases and Deaths by Age Group

Let’s create a dataframe of the weekly data by age group.

# Weekly Totals by Age Group
age_group_totals <- covid |>
  filter(race_ethnicity == "Overall" & sex == "Overall" & age_group != "Overall")

# Display total counts for cases and deaths & mean rates for cases and deaths
age_group_totals |>
  group_by(age_group) |>
  summarise(age_order = mean(age_order),
            cases = sum(case_count, na.rm = TRUE),
            deaths = sum(death_count, na.rm = TRUE),
            mean_case_rate = mean(case_rate, na.rm = TRUE),
            mean_death_rate = mean(death_rate, na.rm = TRUE)
  ) |>
  arrange(desc(age_order), .by_group = TRUE)
## # A tibble: 10 × 6
##    age_group     age_order   cases deaths mean_case_rate mean_death_rate
##    <chr>             <dbl>   <dbl>  <dbl>          <dbl>           <dbl>
##  1 75+ Years            10 1170147 104685          162.           14.6  
##  2 65 - 74 Years         9 1337530  39601          133.            4.04 
##  3 50 - 64 Years         8 3188286  28509          158.            1.53 
##  4 40 - 49 Years         7 2299271   5597          188.            0.671
##  5 30 - 39 Years         6 2644463   2171          202.            0.353
##  6 18 - 29 Years         5 3273449    666          199.            0.154
##  7 16 - 17 Years         4  423796      0          162.          NaN    
##  8 12 - 15 Years         3  703991      0          135.          NaN    
##  9 5 - 11 Years          2 1005827      0          115.          NaN    
## 10 0 - 4 Years           1  588130      7           98.2           0.23

The summary above shows the total number of cases and deaths by age group, as well as their mean weekly rates for cases and deaths.

  • The weekly rates are calculated from the weekly counts. The rate indicates the number of cases (or deaths) per 100,000 population.
  • When rates are calculated for a subgroup (such as: age 18-29 Years), the rate is calculated based on the population of that specific subgroup.
  • Rates allow for an “apples to apples” comparison among subgroups, despite any differences in their population sizes.

Case Rates by Age Group

Let’s create a boxplot to compare the distributions of weekly case rates by age group.

# fct_reorder() reorders factor levels by sorting by another variable
# in this case, age_group will be sorted by age_order 

age_group_totals |>
  filter(!is.na(case_rate)) |>
  ggplot() +
  geom_boxplot(mapping = aes(x = case_rate, y = fct_reorder(age_group, age_order), fill = age_group), show.legend = FALSE) +
  labs(title = "COVID-19 Weekly Case Rates by Age Group (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
       x = "Weekly Case Rate", y = "Age Group") +
  theme_minimal() +
  theme(plot.subtitle = element_text(colour = "darkgray"))

The boxplot of weekly case rates shows that the median case rates were lower for age groups less than 18 years old.

Death Rates by Age Group

Now let’s examine a boxplot of the weekly death rates by age group.

age_group_totals |>
  filter(!is.na(death_rate)) |>
  ggplot() +
  geom_boxplot(mapping = aes(x = death_rate, y = fct_reorder(age_group, age_order), fill = age_group), show.legend = FALSE) +
  labs(title = "COVID-19 Weekly Death Rates by Age Group (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
       x = "Weekly Death Rate", y = "Age Group") +
  theme_minimal() +
  theme(plot.subtitle = element_text(colour = "darkgray"))

The boxplot of weekly death rates shows that the median death rates were relatively low for age groups less than 50 years old, but then start increasing with age:

  • The group for ages 75 and older had the highest median death rate.
  • The group for ages 75 and older had the widest interquartile range for weekly death rate.
  • The group for ages 75 and older had the most extreme outliers for weekly death rate.

Note: There were no deaths listed for any week in the dataset for ages 5-11, ages 12-15, or ages 16-17 (which is why these groups are not included in the boxplot above).

Cases and Deaths by Sex At Birth

Let’s create a dataframe of the weekly data by sex at birth.

# Weekly Totals by Sex At Birth
sex_group_totals <- covid |>
  filter(age_group == "Overall" & race_ethnicity == "Overall" & sex != "Overall")

# Display total counts for cases and deaths & mean rates for cases and deaths
sex_group_totals |>
  group_by(sex) |>
  summarise(cases = sum(case_count, na.rm = TRUE),
            deaths = sum(death_count, na.rm = TRUE),
            mean_case_rate = mean(case_rate, na.rm = TRUE),
            mean_death_rate = mean(death_rate, na.rm = TRUE)
  ) |>
  arrange(desc(deaths), .by_group = TRUE)
## # A tibble: 2 × 5
##   sex      cases deaths mean_case_rate mean_death_rate
##   <chr>    <dbl>  <dbl>          <dbl>           <dbl>
## 1 Male   7508422  97517           149.            1.95
## 2 Female 8950447  84475           173.            1.64

The summary above shows the total number of cases and deaths by sex at birth, as well as their mean weekly rates for cases and deaths.

  • The weekly rates are calculated from the weekly counts. The rate indicates the number of cases (or deaths) per 100,000 population.
  • When rates are calculated for a subgroup (such as: Male), the rate is calculated based on the population of that specific subgroup.
  • Rates allow for an “apples to apples” comparison among subgroups, despite any differences in their population sizes.

Case Rates by Sex At Birth

Let’s create a boxplot to compare the distributions of weekly case rates by sex at birth.

sex_group_totals |>
  filter(!is.na(case_rate)) |>
  ggplot() +
  geom_boxplot(mapping = aes(x = case_rate, y = sex, fill = sex), show.legend = FALSE) +
  labs(title = "COVID-19 Weekly Case Rates by Sex (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
       x = "Weekly Case Rate", y = "Sex At Birth") +
  theme_minimal() +
  theme(plot.subtitle = element_text(colour = "darkgray"))

The boxplot of weekly case rates shows that females tended to be more likely to get COVID-19:

  • The median weekly case rate for females was higher compared to males.
  • The interquartile range for females was wider, primarily due to a higher value for the 75th percentile.

Death Rates by Sex At Birth

Now let’s examine a boxplot of the weekly death rates by sex at birth.

sex_group_totals |>
  filter(!is.na(death_rate)) |>
  ggplot() +
  geom_boxplot(mapping = aes(x = death_rate, y = sex, fill = sex), show.legend = FALSE) +
  labs(title = "COVID-19 Weekly Death Rates by Sex (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
       x = "Weekly Death Rate", y = "Sex At Birth") +
  theme_minimal() +
  theme(plot.subtitle = element_text(colour = "darkgray"))

The boxplot of weekly death rates shows a reversal of the pattern seen in weekly case rates. Males tended to be more likely to die from COVID-19, despite having a lower median case rate:

  • The median weekly death rate for males was slightly higher compared to females.
  • The interquartile range for males was wider, primarily due to a higher value for the 75th percentile.

Cases and Deaths by Race/Ethnicity

Let’s create a dataframe of the weekly data by race/ethnicity group.

# Weekly Totals by Race/Ethnicity Group
race_eth_group_totals <- covid |>
  filter(age_group == "Overall" & sex == "Overall" & race_ethnicity != "Overall")

# Display total counts for cases and deaths & mean rates for cases and deaths
race_eth_group_totals |>
  group_by(race_ethnicity) |>
  summarise(cases = sum(case_count, na.rm = TRUE),
            deaths = sum(death_count, na.rm = TRUE),
            mean_case_rate = mean(case_rate, na.rm = TRUE),
            mean_death_rate = mean(death_rate, na.rm = TRUE)
  ) |>
  arrange(desc(deaths), .by_group = TRUE)
## # A tibble: 5 × 5
##   race_ethnicity   cases deaths mean_case_rate mean_death_rate
##   <chr>            <dbl>  <dbl>          <dbl>           <dbl>
## 1 White, NH      8609844 134173          114.             1.78
## 2 Black, NH      1409467  20868          120.             1.95
## 3 Hispanic       1215287   8805          141.             1.43
## 4 Asian/PI, NH    366237   2580           94.3            1.18
## 5 AI/AN, NH        64169    479          146.             5.53

The summary above shows the total number of cases and deaths by race/ethnicity group, as well as their mean weekly rates for cases and deaths.

  • The weekly rates are calculated from the weekly counts. The rate indicates the number of cases (or deaths) per 100,000 population.
  • When rates are calculated for a subgroup (such as: Hispanic), the rate is calculated based on the population of that specific subgroup.
  • Rates allow for an “apples to apples” comparison among subgroups, despite any differences in their population sizes.

Case Rates by Race/Ethnicity

Let’s create a boxplot to compare the distributions of weekly case rates by race/ethnicity group.

race_eth_group_totals |>
  filter(!is.na(case_rate)) |>
  ggplot() +
  geom_boxplot(mapping = aes(x = case_rate, y = race_ethnicity, fill = race_ethnicity), show.legend = FALSE) +
  labs(title = "COVID-19 Weekly Case Rates by Race/Ethnicity (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
       x = "Weekly Case Rate", y = "Race/Ethnicity") +
  theme_minimal() +
  theme(plot.subtitle = element_text(colour = "darkgray"))

The boxplot of weekly case rates shows differences by race/ethnicity group:

  • The American Indian/Alaska Native group and Hispanic group had higher median case rates, compared to the other race/ethnicity groups.
  • The Asian/Pacific Islander group had the lowest median case rate.

Death Rates by Race/Ethnicity

Now let’s examine a boxplot of the weekly death rates by race/ethnicity group.

race_eth_group_totals |>
  filter(!is.na(death_rate)) |>
  ggplot() +
  geom_boxplot(mapping = aes(x = death_rate, y = race_ethnicity, fill = race_ethnicity), show.legend = FALSE) +
  labs(title = "COVID-19 Weekly Death Rates by Race/Ethnicity (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
       x = "Weekly Death Rate", y = "Race/Ethnicity") +
  theme_minimal() +
  theme(plot.subtitle = element_text(colour = "darkgray"))

The boxplot makes it obvious how much higher the median weekly death rate was for the American Indian/Alaska Native group. The entire distribution of the weekly death rates for this group is shifted higher in value, compared to all the other groups.

This boxplot of weekly death rates also shows a reversal of some of the differences seen among the weekly case rates by race/ethnicity group:

  • Despite having a higher median case rate compared to the Black group and White group, the Hispanic group actually had a slightly lower median death rate.
  • In addition to having the lowest median case rate, the Asian/Pacific Islander group also had the lowest median death rate and the narrowest interquartile range.

Combinations of Demographic Groups

Based on the patterns seen in the data above, we could hypothesize which combinations of demographic groups would be expected to be more likely (or less likely) to have had cases of COVID-19 or to have died from COVID-19.

Hypothesis for Cases

Persons predicted to be less likely to have had cases of COVID-19:

  • Younger (<16 years old)
  • Male
  • Asian/Pacific Islander, Black, or White

Persons predicted to be more likely to have had cases of COVID-19:

  • Older (16+ years old)
  • Female
  • American Indian/Alaska Native or Hispanic

Hypothesis for Deaths

Persons predicted to be less likely to have died from COVID-19:

  • Younger (<65 years old)
  • Female
  • Asian/Pacific Islander or Hispanic

Persons predicted to be more likely to have died from COVID-19:

  • Older (65+ years old)
  • Male
  • American Indian/Alaska Native, Black, or White

Create Flags for Demographic Combinations

Let’s create a dataframe with two separate columns to flag these specific combinations of demographics for cases and deaths:

  • A column named flag_cases to indicate demographic groups predicted to have had a lower incidence of cases (flag = CL) or a higher incidence of cases (flag = CH)

  • A column named flag_deaths to indicate demographic groups predicted to have had a lower incidence of deaths (flag = DL) or a higher incidence of deaths (flag = DH)

# Define age groups for predictions
cases_younger <- c("0 - 4 Years", "5 - 11 Years", "12 - 15 Years")

cases_older <- c("16 - 17 Years", "18 - 29 Years", "30 - 39 Years", "40 - 49 Years", "50 - 64 Years", "65 - 74 Years", "75+ Years")

deaths_younger <- c("0 - 4 Years", "5 - 11 Years", "12 - 15 Years", "16 - 17 Years", "18 - 29 Years", "30 - 39 Years", "40 - 49 Years", "50 - 64 Years")

deaths_older <- c("65 - 74 Years", "75+ Years")

# Create dataframe with new column for flag_cases: CL = Cases Lower, CH = Cases Higher
covid_predictions <- covid |>
  mutate(flag_cases = case_when(
    age_group %in% cases_younger & sex == "Male"
      & race_ethnicity %in% c("Asian/PI, NH", "Black, NH", "White, NH") ~ "CL",
    age_group %in% cases_older & sex == "Female"
      & race_ethnicity %in% c("AI/AN, NH", "Hispanic") ~ "CH"),
    .after = race_ethnicity
  )

# Add new column for flag_deaths: DL = Deaths Lower, DH = Deaths Higher
covid_predictions <- covid_predictions |>
  mutate(flag_deaths = case_when(
    age_group %in% deaths_younger & sex == "Female"
      & race_ethnicity %in% c("Asian/PI, NH", "Hispanic") ~ "DL",
    age_group %in% deaths_older & sex == "Male"
      & race_ethnicity %in% c("AI/AN, NH", "Black, NH", "White, NH") ~ "DH"),
    .after = flag_cases
  )

# Replace numeric NA data with 0
covid_predictions <- covid_predictions %>% mutate_if(is.numeric, replace_na, replace = 0)

Flagged Demographic Combinations for Cases

Let’s calculate the mean weekly rates for the demographic groups predicted to have had a higher incidence of cases (flag = CH) or a lower incidence of cases (flag = CL).

  • CH (Cases Higher) = Older (16+ years old) AND Female AND (American Indian/Alaska Native or Hispanic)
  • CL (Cases Lower) = Younger (<16 years old) AND Male AND (Asian/Pacific Islander, Black, or White)
# Display total counts for cases and deaths & mean rates for cases and deaths
covid_predictions |>
  filter(!is.na(flag_cases)) |>
  group_by(flag_cases) |>
  summarise(mean_case_rate = mean(case_rate)
  )
## # A tibble: 2 × 2
##   flag_cases mean_case_rate
##   <chr>               <dbl>
## 1 CH                  165. 
## 2 CL                   82.9

The group flagged as CH had a mean weekly case rate 2 times higher than the group flagged as CL.

Let’s create a boxplot to compare the distributions of weekly case rates by flag_case group.

covid_predictions |>
  filter(flag_cases %in% c("CL", "CH")) |>
  ggplot() +
  geom_boxplot(mapping = aes(x = case_rate, y = flag_cases, fill = flag_cases), show.legend = FALSE) +
  labs(title = "COVID-19 Weekly Case Rates (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
       x = "Weekly Case Rate", y = "Demographic Flag") +
  theme_minimal() +
  theme(plot.subtitle = element_text(colour = "darkgray"))

Let’s create a plot comparing the weekly case rates for the two demographic groups predicted to be either more likely (CH) or less likely (CL) to have had COVID-19.

library(ggthemes)
covid_predictions |>
  filter(flag_cases %in% c("CL", "CH")) |>
  group_by(end_of_week, flag_cases) |>
  ggplot() +
  geom_line(mapping = aes(x = end_of_week, y = case_rate, group = end_of_week, color = flag_cases)) +
  theme_hc() + 
  labs(title = "COVID-19 Weekly Case Rates (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
       x = "", y = "Case Rate", color = "") +
  theme(plot.subtitle = element_text(colour = "darkgray")) +
  theme(legend.position = "bottom")

Flagged Demographic Combinations for Deaths

Now let’s calculate the mean weekly rates for the demographic groups predicted to have had a higher incidence of deaths (flag = DH) or a lower incidence of deaths (flag = DL).

  • DH (Deaths Higher) = Older (65+ years old) AND Male AND (Black or White)
  • DL (Deaths Lower) = Younger (<65 years old) AND Female AND (Asian/Pacific Islander, American Indian/Alaska Native, or Hispanic)
covid_predictions |>
  filter(!is.na(flag_deaths)) |>
  group_by(flag_deaths) |>
  summarise(mean_death_rate = mean(death_rate)
  )
## # A tibble: 2 × 2
##   flag_deaths mean_death_rate
##   <chr>                 <dbl>
## 1 DH                   8.07  
## 2 DL                   0.0802

The group flagged as DH had a mean weekly death rate ~100 times higher than of the group flagged as DL.

Let’s create a boxplot to compare the distributions of weekly death rates by flag_death group.

covid_predictions |>
  filter(flag_deaths %in% c("DL", "DH")) |>
  ggplot() +
  geom_boxplot(mapping = aes(x = death_rate, y = flag_deaths, fill = flag_deaths), show.legend = FALSE) +
  labs(title = "COVID-19 Weekly Death Rates (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
       x = "Weekly Death Rate", y = "Demographic Flag") +
  theme_minimal() +
  theme(plot.subtitle = element_text(colour = "darkgray"))

Now let’s create a plot comparing the weekly death rates for the two demographic groups that were predicted to be more likely or less likely to have died from COVID-19.

covid_predictions |>
  filter(flag_deaths %in% c("DL", "DH")) |>
  group_by(end_of_week, flag_deaths) |>
  ggplot() +
  geom_line(mapping = aes(x = end_of_week, y = death_rate, group = end_of_week, color = flag_deaths)) +
  theme_hc() + 
  labs(title = "COVID-19 Weekly Death Rates (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
       x = "", y = "Death Rate", color = "") +
  theme(plot.subtitle = element_text(colour = "darkgray")) +
  theme(legend.position = "bottom")

The visualizations are consistent with the predictions from the hypotheses, but further investigation will be needed to determine if the differences among the demographic groups are statistically significant.

Furthermore, it is not clear why these patterns occur - what physical or social factors might be causing higher or lower incidences of cases and deaths among these groups?


This dataset will be explored further in subsequent data dives.