COVID-19 Weekly Cases and Deaths in US Region 5 (March 7, 2020 - November 18, 2023)

Introduction

This project focuses on a statistical analysis of a CDC dataset that contains weekly counts and rates of COVID-19 cases and deaths reported in the states of Illinois, Indiana, Michigan, Minnesota, Ohio, and Wisconsin from March 7, 2020 through November 18, 2023.

The original dataset was downloaded from the CDC Data Catalog and contains over 400,000 rows of data for the entire United States (including US territories), which are categorized into 10 regions by the Department of Health and Human Services (HHS).

The CDC summarized individual reports of COVID-19 cases and deaths into weekly counts and rates (i.e., number per 100,000 population) categorized by HHS region, age group, sex at birth, and race/ethnicity group.

However, for this project, the dataset was filtered to only include data for Region 5 (IL, IN, MI, MN, OH, WI), which resulted in a revised data set of about 38,000 rows.

The goal of this project is to understand patterns and disparities in COVID-19 cases and deaths to inform future prevention efforts and address health inequities.

The potential stakeholders for this project include:

  • Healthcare Providers

  • Local and State Public Health Organizations

  • Local and State Government


Load Libraries and Dataset

To get started, we’ll need to load several R library packages to assist with the analysis.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(tsibble)
## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union
library(xts)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following object is masked from 'package:tsibble':
## 
##     index
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'xts'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
theme_set(theme_minimal())
options(scipen = 6)

Read in Dataset

Next, we’ll read in the dataset, which is saved in a CSV file.

covid <- read_delim("./COVID_weekly_cases_deaths_region5.csv", delim = ",")
## Rows: 37867 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): end_of_week, jurisdiction, age_group, sex, race_ethnicity_combined
## dbl (4): case_count_suppressed, death_count_suppressed, case_crude_rate_supp...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The dataset contains 37,867 rows of data with 9 columns per row.

Here’s the full list of column names.

colnames(covid)
## [1] "end_of_week"                         
## [2] "jurisdiction"                        
## [3] "age_group"                           
## [4] "sex"                                 
## [5] "race_ethnicity_combined"             
## [6] "case_count_suppressed"               
## [7] "death_count_suppressed"              
## [8] "case_crude_rate_suppressed_per_100k" 
## [9] "death_crude_rate_suppressed_per_100k"

We’ll need to convert end_of_week from a character format to a date format, which will be necessary for time-based analyses.

# convert date from chr format to date format
covid$end_of_week <- as.Date(covid$end_of_week, format="%m/%d/%y")

We’ll add a new column to indicate the order of the age groups, and then arrange all the rows in nested order by week, age, sex, and race/ethnicity.

# add new column to order age groups correctly (using integers L)
covid <- covid |>
  mutate(age_order = case_when(
    age_group == "Overall" ~ 0L,
    age_group == "0 - 4 Years" ~ 1L,
    age_group == "5 - 11 Years" ~ 2L,
    age_group == "12 - 15 Years" ~ 3L,
    age_group == "16 - 17 Years" ~ 4L,
    age_group == "18 - 29 Years" ~ 5L,
    age_group == "30 - 39 Years" ~ 6L,
    age_group == "40 - 49 Years" ~ 7L,
    age_group == "50 - 64 Years" ~ 8L,
    age_group == "65 - 74 Years" ~ 9L,
    age_group == "75+ Years" ~ 10L),
    .after = age_group) |>
  arrange(end_of_week, age_order, sex, race_ethnicity_combined)

Let’s create another dataframe with just the overall weekly data (i.e., one summary row per week), which will be used for some of the analyses.

weekly_data <- covid |>
  filter(age_group == "Overall" & sex == "Overall" & race_ethnicity_combined == "Overall")

Demographic Groups

Each row in the dataset contains aggregated data for reported COVID-19 cases and deaths from a specific week for a specific demographic group. Each demographic group is based on a combination of age group, sex at birth, and race/ethnicity group:

  • Age Group has 11 possible values in this dataset:
    • 0-4 Years
    • 5-11 Years
    • 12-15 Years
    • 16-17 Years
    • 18-29 Years
    • 30-39 Years
    • 40-49 Years
    • 50-64 Years
    • 65-74 Years
    • 75+ Years
    • Overall (all ages combined)
  • Sex At Birth has 3 possible values in this dataset:
    • Female
    • Male
    • Overall (both sexes combined)
  • Race/Ethnicity Group has 6 possible values in this dataset:
    • AI/AN, NH, = American Indian / Alaska Native, Non-Hispanic
    • Asian/PI, NH = Asian / Pacific Islander, Non-Hispanic
    • Black, NH = Black, Non-Hispanic
    • Hispanic = Hispanic (could be any race or combination of races)
    • White, NH = White, Non-Hispanic
    • Overall (all race/ethnicity groups combined)

Based on the number of possible values for the three demographic variables, the total number of different demographic group combinations is:

\[ 11 \times 3 \times 6 = 198 \text{ demographic combinations} \]

Therefore, the dataset should have 198 rows of data for each week: i.e., one row for each possible demographic group for each week.

The dataset has data for 194 continuous weeks from the week ending March 7, 2020 through the week ending November 18, 2023. Therefore, the total dataset should contain ≈ 38K rows of data.

\[ 194 \text{ weeks} \times 198 \text{ demographic rows per week} = 38,412 \text{ rows} \]

However, it turns out there are implicit missing rows, so not every week has full set of 198 rows. A total of 545 rows are missing out of 38,412 expected rows. These implicit missing rows were explored as part of a previous data dive on this dataset.

Notes on Race/Ethnicity Groups

Although data for an individual’s race and ethnicity are typically collected as separate demographic variables in public health records, the CDC typically combines these fields for data analysis purposes by categorizing individuals as belong to a particular race/ethnicity group.

  • If an individual identifies as Hispanic/Latino for ethnicity, the CDC categorizes that individual’s race/ethnicity group as Hispanic, regardless of which race(s) that individual identifies as.

  • The CDC then defines the other race/ethnicity groups as individuals who identify as non-Hispanic and as a single race.

In this particular dataset, the CDC has not included a separate race/ethnicity group for individuals who identify as Non-Hispanic and multiple races. Instead, the CDC has included data for these individuals as part of the overall weekly counts for race/ethnicity.

Individuals with Missing Demographic Data

Furthermore, data for individuals with unknown or missing age, sex, or race/ethnicity are included as part of the overall weekly data for the missing demographic variable.


Cases and Deaths Over Time

Let’s begin the exploration of this dataset by creating a line plot of the weekly COVID-19 case counts over time for Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, and Wisconsin).

# plot weekly cases over time
weekly_data |>
  ggplot() +
  geom_line(mapping = aes(x = end_of_week, y = case_count_suppressed), color = 'darkturquoise', na.rm = TRUE) +
  labs(title = "COVID-19 Weekly Cases (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "", y = "Case Count") +
  theme(legend.title=element_blank()) +
  theme(plot.subtitle = element_text(color = "darkgray")) +
  theme(legend.position = "bottom")

The plot of cases shows a pattern of peaks and valleys across the weeks, with the two largest peaks in cases occurring from late 2020 into early 2021 and then from late 2021 into early 2022.

Next let’s examine a line plot of the weekly COVID-19 death counts over time for this region.

# plot weekly deaths over time
weekly_data |>
  ggplot() +
  geom_line(mapping = aes(x = end_of_week, y = death_count_suppressed), color = 'brown1', na.rm = TRUE) +
  labs(title = "COVID-19 Weekly Deaths (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "", y = "Death Count") +
  theme(legend.title=element_blank()) +
  theme(plot.subtitle = element_text(color = "darkgray")) +
  theme(legend.position = "bottom")

The plot of deaths also shows a pattern of peaks and valleys across the weeks, with the three largest peaks in deaths occurring in spring 2020, late 2020 into early 2021, and from late 2021 into early 2022.

Comparing Patterns of Cases and Deaths Over Time

At first glance, it might seem that the patterns in cases and deaths are different. For example, the plot of cases show two large peaks while the plot of deaths has 3 large peaks.

It’s important to note the difference in magnitude in the scales of the vertical axes of these two plots. Whereas the plot for death counts is scaled in thousands, the plot for case counts is scaled in hundreds of thousands.

Because of this large difference in magnitude in the counts for cases and deaths, it’s not very practical to visualize them together on one plot using the same scale for their raw counts.

However, a log transformation can be used to produce values for the counts that could be visualized together in one plot using the same scale.

To do this, we’ll need to transform each weekly count using \(log(count)\) where \(log\) represents the natural logarithm.

# use log transform to display cases and deaths over time on same plot
weekly_data |>
  ggplot() +
  geom_line(mapping = aes(x = end_of_week, y = log(case_count_suppressed),
                          color = 'log(cases)'), na.rm = TRUE) +
  geom_line(mapping = aes(x = end_of_week, y = log(death_count_suppressed),
                          color = 'log(deaths)'), na.rm = TRUE) +
  scale_color_manual(name = '',
                    breaks = c('log(cases)', 'log(deaths)'),
                    values = c('log(cases)' = 'darkturquoise', 'log(deaths)' = 'brown1')) +
  labs(title = "COVID-19 Weekly log(Cases) and log(Deaths) (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "", y = "log(count)") +
  theme(plot.subtitle = element_text(color = "darkgray")) +
  theme(legend.position = "bottom")

This plot of the log transformed counts of cases and deaths shows that their patterns do closely mirror each other over time, as one would might expect: as the number of cases rises or falls over time, so too does the number of deaths.

This reinforces the simple idea that the best way to prevent (or reduce) deaths from COVID-19 is to prevent (or reduce) cases of COVID-19. Reducing community transmission of COVID-19 is key to preventing deaths among the most vulnerable within the community.

Seasons within Rolling Averages of Cases and Deaths

We can utilize time series modeling to generate plots of rolling averages to help detect potential seasons (i.e., repeating periodic patterns) within the data over time.

We’ll need dataframes with just two columns: our explanatory time-based variable and a response variable.

We’ll create one dataframe with case counts as the response variable and another dataframe with death counts as the response variable.

# create dataframes with date (end of week) and counts (cases or deaths)
weekly_cases <- weekly_data |>
  select(end_of_week, case_count = case_count_suppressed) |>
  arrange(end_of_week)

weekly_deaths <- weekly_data |>
  select(end_of_week, death_count = death_count_suppressed) |>
  arrange(end_of_week)

Next, we’ll create a tsibble object from each of these time-based dataframes.

# tsibble objects for weekly cases and weekly deaths
weekly_cases_ts <- as_tsibble(weekly_cases, index = end_of_week) |>
   fill_gaps()

weekly_deaths_ts <- as_tsibble(weekly_deaths, index = end_of_week) |>
  fill_gaps()

Finally, we’ll use the tsibble objects to create “xts” time series dataframes, which we can use in our plots.

# xts dataframe for weekly cases
weekly_cases_xts <- xts(x = weekly_cases_ts$case_count, 
                        order.by = weekly_cases_ts$end_of_week)

weekly_cases_xts <- setNames(weekly_cases_xts, "cases")

# xts dataframe for weekly deaths
weekly_deaths_xts <- xts(x = weekly_deaths_ts$death_count, 
                         order.by = weekly_deaths_ts$end_of_week)

weekly_deaths_xts <- setNames(weekly_deaths_xts, "deaths")

Let’s visualize the count data using a 28-day rolling average (i.e., 4 weeks ≈ one month), which should smooth out some of the variation seen in the plots of the raw data.

We’ll also include a LOESS (Locally Estimated Scatterplot Smoothing) curve to help identify any seasons (i.e., repeating periodic patterns) within the time series data.

First, let’s examine the plot for the 28-day rolling average of weekly case counts.

# plot 4 week (28 days, approx 1 month) rolling average of weekly case counts
weekly_cases_xts %>%
  rollapply(width = 4, \(x) mean(x), fill = FALSE) %>%
  ggplot(mapping = aes(x = Index, y = cases)) +
  geom_line() +
  geom_smooth(method = 'loess', span = 0.25, 
              color = 'darkturquoise', se = FALSE, na.rm = TRUE) +
  geom_vline(xintercept = as.Date('2020-06-01'), linetype="dashed", color = 'blue') +
  geom_vline(xintercept = as.Date('2021-06-01'), linetype="dashed", color = 'blue') +
  geom_vline(xintercept = as.Date('2022-06-01'), linetype="dashed", color = 'blue') +
  geom_vline(xintercept = as.Date('2023-06-01'), linetype="dashed", color = 'blue') +
  ylim(0, 600000) +
  labs(title = "COVID-19 Weekly Cases (Mar 2020 - Nov 2023)",
       subtitle = "28-Day Rolling Average for Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "", y = "Cases") +
  theme(plot.subtitle = element_text(colour = "darkgray"))
## `geom_smooth()` using formula = 'y ~ x'

Using a span value of 0.25 produces a LOESS curve that appears to have a season of approximately 12 months (≈52 weeks) from peak to peak (or from trough to trough). Dotted vertical lines (spaced 12 months apart) have been added to the plot to help indicate the duration of a 12-month seasonality.

For example, the period from June 2020 to June 2021 could represent the length of one season (trough to trough), or the period from December 2020 to December 2021 could represent the length of one season (peak to peak).

However, the seasons in the plot above vary in amplitude: the second season has a higher peak (Jan 2022), and then the third season has a reduced peak (Dec 2022).

  • The second peak coincides with the appearance of the Omicron variant of COVID-19 which was more infectious (and for which variant-specific vaccines had not yet been developed and made available).

  • The reduced third peak (Dec 2022) could be explained by the fact that many more people would have likely acquired some level of immunity either through vaccinations or previous infections.

It does seems logical that a season would exist for this data and that it would follow a yearly cycle: COVID-19 is an infectious disease spread from person-to-person, and people’s activities tend to follow yearly patterns. We would expect that COVID-19 cases would increase with yearly activities that bring people together in close contact, especially indoor activities (e.g., start of school year, end-of-year holiday gatherings, etc.).

Autocorrelation is another method that can be used to detect seasons in time series data by calculating the amount that a variable correlates with itself at different lags (i.e., past time periods). For this dataset, each lag represents one week.

If a predictable season exists within the data, an ACF (autocorrelation function) plot would help identify it.

# generate ACF plot for weekly cases to help identify seasons
acf(weekly_cases_ts, lag.max = 100, ci = 0.95, na.action = na.exclude, main = "ACF for Weekly Case Counts")

The ACF plot shows the lowest point of the first trough occurs at about lag 29 (29 weeks, ≈6 months) and then the highest point of the next peak occurs at about lag 58 (58 weeks, ≈12 months), indicating a season of about 1 year.

Next let’s examine the plot of the the 28-day rolling average of weekly death counts.

# plot 4 week (28 days, approx 1 month) rolling average of weekly death counts
weekly_deaths_xts %>%
  rollapply(width = 4, \(x) mean(x), fill = FALSE) %>%
  ggplot(mapping = aes(x = Index, y = deaths)) +
  geom_line() +
  geom_smooth(method = 'loess', span = 0.25, 
              color = 'brown1', se = FALSE, na.rm = TRUE) +
  geom_vline(xintercept = as.Date('2020-06-01'), linetype="dashed", color = 'blue') +
  geom_vline(xintercept = as.Date('2021-06-01'), linetype="dashed", color = 'blue') +
  geom_vline(xintercept = as.Date('2022-06-01'), linetype="dashed", color = 'blue') +
  geom_vline(xintercept = as.Date('2023-06-01'), linetype="dashed", color = 'blue') +
  ylim(0, 5000) +
  labs(title = "COVID-19 Weekly Deaths (Mar 2020 - Nov 2023)",
       subtitle = "28-Day Rolling Average for Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "", y = "Deaths") +
  theme(plot.subtitle = element_text(colour = "darkgray"))
## `geom_smooth()` using formula = 'y ~ x'

Again, using a span value of 0.25 produces a LOESS curve that appears to have a season of a season of approximately 12 months (≈52 weeks) from peak to peak (or from trough to trough). Dashed vertical lines have been placed at the same time points as the previous plot to to help indicate the duration of a 12-month seasonality.

This similarity in seasonality between cases and deaths makes sense as we already know from the log transform plot that the pattern of death counts closely mirrors the pattern of case counts.

Here’s the ACF (autocorrelation function) plot for weekly death counts.

# generate ACF plot for weekly deaths to help identify seasons
acf(weekly_deaths_ts, lag.max = 100, ci = 0.95, na.action = na.exclude, main = "ACF for Weekly Death Counts")

The ACF plot shows the lowest point of the first trough occurs at about lag 25 (25 weeks, ≈6 months) and then the highest point of the next peak occurs at about lag 54 (54 weeks, ≈12 months), again indicating a season of about 1 year.

Although the overall number of cases and deaths from COVID-19 have decreased from the levels seen in 2020 through 2022, we should anticipate that while COVID-19 continues to circulate in the population (which could be indefinitely if the virus becomes endemic, as many experts expect), it will likely follow a yearly seasonal pattern of peaks and valleys in cases and deaths. By timing prevention efforts according, we may be able to reduce the impact of the virus during the projected peaks of each yearly season.

Ratio of Deaths to Cases Over Time

Finally, another way to examine the pattern between cases and deaths over time is consider the ratio between cases and deaths. We could calculate a ratio as:

\[ \text{Ratio = Deaths / Cases} \]

However, it might help to translate that ratio into something more intuitive for people to understand. For example, calculating a ratio (or rate) that tells us how many deaths occur per every 100 cases.

This would be akin to a percentage: if 5 deaths occur for every 100 cases, that means about 5% of cases result in death.

This ratio, or rate, would be calculated as follows:

\[ \text{Death Rate (per 100 cases)} = \frac{Deaths}{Cases} \times 100 \]

Let’s add a new column that will calculate the deaths per 100 cases for each week’s data.

# add calculated column for death rate per 100 cases
weekly_data <- weekly_data |>
  mutate(deaths_per_100_cases = death_count_suppressed / case_count_suppressed * 100)

Let’s use this new column to plot a rolling average of the deaths per 100 cases over time. First, we’ll prepare dataframes with our explanatory time-based variable and the case-to-death ratio as the response variable.

# create dataframe with date (end of week) and death rate per 100 cases
weekly_ratio <- weekly_data |>
  select(end_of_week, deaths_per_100_cases) |>
  arrange(end_of_week)

# tsibble object
weekly_ratio_ts <- as_tsibble(weekly_ratio, index = end_of_week) |>
   fill_gaps()

# xts dataframe
weekly_ratio_xts <- xts(x = weekly_ratio_ts$deaths_per_100_cases, 
                        order.by = weekly_ratio_ts$end_of_week)

weekly_ratio_xts <- setNames(weekly_ratio_xts, "ratio")

Now we can use this dataframe to plot a 28-day rolling average of the case-to-death ratio, which should smooth out some of the variation seen that would likely occur in plotting the raw data.

# plot 4 week (28 days, approx 1 month) rolling average of deaths per 100 cases
weekly_ratio_xts %>%
  rollapply(width = 4, \(x) mean(x), fill = FALSE) %>%
  ggplot(mapping = aes(x = Index, y = ratio)) +
  geom_line(color = 'brown1') +
  scale_y_continuous(breaks = seq(0, 10, by = 2)) +
  labs(title = "COVID-19 Weekly Deaths per 100 Cases (Mar 2020 - Nov 2023)",
       subtitle = "28-Day Rolling Average for Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "", y = "Deaths per 100 Cases") +
  theme(plot.subtitle = element_text(colour = "darkgray"))

Over time, the deaths per 100 cases rate has decreased. At the beginning of the pandemic, the rate peaked at over 10 deaths per 100 cases (i.e., about 10% of cases resulted in death). While that rate has decreased over time, there may be a lower limit on how much it can and will decrease any further.

While this decrease over time is good news, it reinforces the notion that there will always be some number of deaths within a set of cases, so the best way to reduce the number of deaths is to reduce the total number of cases within the community.

Cumulative Deaths Over Time

In viewing and analyzing COVID-19 data (especially several years after the start of the pandemic), it can be easy to become somewhat detached from what the data and visualizations represent, which is the fact that hundreds of thousands of people in this region (and millions across the US) became ill and died from a new respiratory virus.

Here is a plot of the weekly cumulative COVID-19 death counts over time in the six states comprising Region 5 of the United States.

# calculate total deaths
total_deaths <- sum(weekly_data$death_count_suppressed)

# plot cumulative deaths over time
weekly_data |>
  ggplot() +
  geom_line(mapping = aes(x = end_of_week, y = cumsum(death_count_suppressed)), color = 'brown1', na.rm = TRUE) +
  ylim(0, 200000) +
  annotate("text", x = as.Date('2023-11-01'), y = 190000, label = total_deaths, color = 'brown1') +
  labs(title = "COVID-19 Cumulative Deaths (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "", y = "Death Count") +
  theme(legend.title=element_blank()) +
  theme(plot.subtitle = element_text(color = "darkgray")) +
  theme(legend.position = "bottom")

Steeper portions of the plot indicate peaks in deaths, whereas flatter portions of the plot indicate fewer deaths occurring. The cumulative count is staggering.

Over 180,000 lives were lost in just 6 states in less than 4 years (with about 90% of those deaths occurring within the first 2 years). It is difficult to truly understand the grief and loss this represents to so many families and their friends.

In addition, research studies have indicated that official COVID-19 death counts (such as the original dataset used in this project) have likely undercounted the actual number of deaths due to COVID-19 (see this January 2023 article and this February 2024 article).


Cases and Deaths by Sex at Birth

Let’s examine patterns and disparities that may exist in COVID-19 cases and deaths by sex at birth.

Percent of Cases vs. Percent of Deaths by Sex at Birth

Let’s use the weekly counts of cases and deaths to create two new columns that calculate the percent of cases and the percent of deaths for each week by sex at birth.

# for each week, calculate percent of cases and percent of deaths for each sex
covid_sex_percent <- covid |>
  filter(race_ethnicity_combined == "Overall" & age_group == "Overall" & sex != "Overall" ) |>
  group_by(end_of_week) |>
  mutate(percent_weekly_cases = round(case_count_suppressed / sum(case_count_suppressed, na.rm = TRUE), 3)) |>
  mutate(percent_weekly_deaths = round(death_count_suppressed / sum(death_count_suppressed, na.rm = TRUE), 3))

Let’s create a scatter plot of the weekly values for percent of cases vs. percent of deaths by sex at birth.

# scatterplot of percent of weekly cases vs.  percent of weekly deaths by sex
covid_sex_percent |>
  ggplot() +
  geom_point(mapping = aes(x = percent_weekly_cases, y = percent_weekly_deaths, color = sex), na.rm = TRUE) +
  labs(x = "Proportion of Weekly Cases",
       y = "Proportion of Weekly Deaths",
       title = "COVID-19 Weekly % Cases vs. Weekly % Deaths by Sex at Birth",
       subtitle = "Region 5 (IL, IN, MI, MN, OH, WI) from Mar 2020 - Nov 2023") +
  theme_minimal() +
  theme(plot.subtitle = element_text(colour = "darkgray"))

Each week in the dataset is represented by two points in the scatterplot: one for females and one for males. For each week’s set of two points, the values for females and males are related, in that they total to 100%.

So if females represented 52% of the cases in a given week, then we know that males must have represented 48% of the cases for that week. Thus, there is a symmetry to be expected in the plot as each “female” point is related to a corresponding “male” point from the same week (i.e., these points representing the same week will mirror each other relative to the 50% center line).

If there were no disparities in cases and deaths among sex at birth, we would expect the entire group of points for females and males to overlap closely with each representing on average about 50% of the weekly cases and about 50% of the weekly deaths. There would likely be variation from week to week, but in general, we’d expect the points to be clustered within the center of the plot, with points for females and males interspersed with each other.

Instead, there is nearly complete separation of the data points for females versus males:

  • For almost every week in the dataset (190 weeks out of 194 total weeks), females accounted for more of the weekly cases (i.e., 50% or higher). This is demonstrated by nearly all of the points for female being to the right of the vertical center (50% of cases) of the plot.

  • However, males tended to account for more of the weekly deaths, despite accounting for fewer of the weekly cases. This is demonstrated by many of the points for male being above the horizontal center (50% of deaths) of the plot.

Here is a table with the average percent for cases and deaths by sex at birth, confirming these differences seen in the scatterplot.

# calculate average weekly percent of cases and deaths by sex at birth
covid_sex_percent |>
  group_by(sex) |>
  summarise(
    avg_perc_cases = round(mean(percent_weekly_cases, na.rm = TRUE), 2),
    avg_perc_deaths = round(mean(percent_weekly_deaths, na.rm = TRUE), 2)
  )
## # A tibble: 2 × 3
##   sex    avg_perc_cases avg_perc_deaths
##   <chr>           <dbl>           <dbl>
## 1 Female           0.55            0.47
## 2 Male             0.45            0.53

Comparing Rates Over Time by Sex at Birth

Let’s explore these differences in sex at birth by investigating their cases and deaths over time.

When comparing different groups within a population, it is better to compare their rates rather than their counts. Rates represent counts adjusted for the group size within the population, which allows more fair comparisons among groups that might vary in size.

In public health, a rate is often calculated as a count per 100K population.

\[ \text{Rate (per 100K)} = \text{Count} \times \frac{100,000}{\text{Group Size}} \]

For example, if a group had a count of 500 cases in one week and the group’s size is 200,000 individuals, the group’s case rate for that week was 250 cases per 100K population.

Rates can be calculated for the overall population and for specific groups within the overall population.

The dataset already includes calculated rates for the overall population and for each demographic group. These rates were presumably calculated using group size estimates based on US Census data projections.

Let’s examine the weekly COVID-19 case rates over time by sex at birth.

# plot case rates over time by sex at birth
covid |>
  filter(age_group == "Overall" & race_ethnicity_combined == "Overall" & sex != "Overall") |>
  group_by(sex) |>
  ggplot() +
  geom_line(mapping = aes(x = end_of_week, y = case_crude_rate_suppressed_per_100k,
                          group = sex, color = sex), na.rm = TRUE) +
  labs(title = "COVID-19 Weekly Case Rates by Sex at Birth (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "", y = "Case Rate") +
  theme(legend.title=element_blank()) +
  theme(plot.subtitle = element_text(color = "darkgray")) +
  theme(legend.position = "bottom")

The plot shows that females tend to have a higher case rate than males, especially during peaks. This is consistent with the pattern seen earlier in the scatter plot.

This could be a result of social factors that make it more likely for females to be in close contact with others due to their professions and/or other activities; thus, they would be more likely to become infected with COVID-19. For example, education and healthcare are two examples of professional fields that involve close contact with people, and both fields have predominantly female workers.

Alternatively, this could also be partially a result of under-testing of males. If males were less likely to be tested for COVID-19, their cases would be under-reported.

Next let’s examine the weekly COVID-19 death rates over time by sex at birth.

# plot death rates over time by sex at birth
covid |>
  filter(age_group == "Overall" & race_ethnicity_combined == "Overall" & sex != "Overall") |>
  group_by(sex) |>
  ggplot() +
  geom_line(mapping = aes(x = end_of_week, y = death_crude_rate_suppressed_per_100k,
                          group = sex, color = sex), na.rm = TRUE) +
  labs(title = "COVID-19 Weekly Death Rates by Sex at Birth (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "", y = "Death Rate") +
  theme(legend.title=element_blank()) +
  theme(plot.subtitle = element_text(color = "darkgray")) +
  theme(legend.position = "bottom")

In this plot, we see the reverse pattern: males tend to have a higher death rate than females, despite the fact that males tend have a lower case rate. Again, this is consistent with the pattern seen earlier in the scatter plot.

This could be the result of underlying health conditions. Males tend to have more chronic health conditions than females (article), which would place them at greater risk of death from COVID-19.

Hypothesis Test for Differences by Sex at Birth

Let’s perform hypothesis tests for the observed differences in case rate and death rate by sex at birth.

For case rate, the null hypothesis to be tested is:

\[ H_0: \text{The average weekly case rate is equal for both sexes.} \]

The alternative hypothesis is:

\[ H_a: \text{The average weekly case rate is not equal for both sexes.} \]

Through hypothesis testing, we’ll determine whether the null hypothesis should be rejected. If we were to reject the null hypothesis as false, we would logically assume the alternative hypothesis should be accepted instead.

# weekly case data that excludes the overall summary rows & rows with missing case rates
weekly_case_data <- covid |>
  filter(age_group != "Overall" & sex != "Overall" & race_ethnicity_combined != "Overall") |>
  filter(!is.na(case_crude_rate_suppressed_per_100k))

Because we’re comparing the means of two independent samples, we’ll use a two-tailed t-test to calculate the p-value for our observed difference in means.

# Welch's t-test (compare means of two independent samples)
t.test(case_crude_rate_suppressed_per_100k ~ sex, data = weekly_case_data, alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  case_crude_rate_suppressed_per_100k by sex
## t = 7.3441, df = 16849, p-value = 2.165e-13
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  12.82750 22.16743
## sample estimates:
## mean in group Female   mean in group Male 
##             137.1420             119.6445

The calculated p-value from the t-test is extremely low, which would lead us to reject the null hypothesis as false and assume that the alternative hypothesis is true: the average weekly case rates by sex are not equal.

As seen in the previous line plot, the weekly case rate tends to be higher for females than males. The t-test also determines that females have a mean case rate that is between 12.8 to 22.2 higher (95% Confidence Interval) compared to the mean case rate for males.

Next, we’ll perform a hypothesis test for differences in death rate by sex at birth.

For death rate, the null hypothesis to be tested is:

\[ H_0: \text{The average weekly death rate is equal for both sexes.} \]

The alternative hypothesis is:

\[ H_a: \text{The average weekly death rate is not equal for both sexes.} \]

Again, we’ll determine whether the null hypothesis should be rejected and whether the alternative hypothesis should be assumed to be accepted.

# weekly death data that excludes the overall summary rows & rows with missing death rates
weekly_death_data <- covid |>
  filter(age_group != "Overall" & sex != "Overall" & race_ethnicity_combined != "Overall") |>
  filter(!is.na(death_crude_rate_suppressed_per_100k))

Here are the results from the two-tailed t-test.

# Welch's t-test (compare means of two independent samples)
t.test(death_crude_rate_suppressed_per_100k ~ sex, data = weekly_death_data, alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  death_crude_rate_suppressed_per_100k by sex
## t = -5.0813, df = 2500.2, p-value = 0.0000004025
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -4.799771 -2.126775
## sample estimates:
## mean in group Female   mean in group Male 
##             9.319622            12.782895

Again, the calculated p-value from the t-test is extremely low, which would lead us to reject the null hypothesis as false and assume that the alternative hypothesis is true: the average weekly death rates by sex are not equal.

As seen in the previous line plot, the weekly death rate tends to be higher for males than females. The t-test also determines that males have a mean death rate that is between 2.1 to 4.8 higher (95% Confidence Interval) compared to the mean death rate for females.

These results may indicate that prevention efforts could be even more effective if they were tailored slightly differently for each sex. For example, prevention for females might emphasize masking as a way to reduce risk of infection (to reduce cases), while prevention for males might also emphasize testing and early treatment (to reduce deaths).

Of course, all prevention methods (such as masking, testing, treatment, etc.) are important for all individuals regardless of sex, but subtle differences in how prevention is “marketed” to females versus males might yield increased effectiveness.


Cases and Deaths by Age Group

Next, let’s begin to examine patterns and disparities that may exist in COVID-19 cases and deaths by age group.

Let’s first compare the total death counts by age group.

# plot total deaths by age group
# fct_reorder() reorders factor levels by sorting by another variable
# in this case, age_group will be sorted by age_order
covid |>
  filter(age_group != "Overall" & sex == "Overall" & race_ethnicity_combined == "Overall" & !is.na(death_count_suppressed)) |>
  group_by(age_group) |>
  ggplot() +
  geom_bar(mapping = aes(x = fct_reorder(age_group, age_order),
                         y = death_count_suppressed),
           stat = 'identity', fill = 'brown1') + 
  coord_flip() +
  labs(title = "COVID-19 Weekly Deaths by Age Group (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "", y = "Deaths") +
  theme_minimal() +
  theme(plot.subtitle = element_text(colour = "darkgray"))

The total number of deaths greatly increases with age.

There were a total of 7 deaths for ages 0-4 years, but this value is too small to be visible in the plot. Note that there were no deaths for the age groups between 5-17 years (which is why those age groups are missing from the plot above).

The oldest age group of 75+ years appears to account for more than half the total deaths. Overall, the three oldest age groups (age 50+ years) account for the overwhelming majority of deaths.

Unfortunately, this makes sense as older people are more likely to have underlying health conditions that may make them more susceptible to death from COVID-19 infection. Of course, there are other factors that likely contribute to this pattern (e.g., retirees may be more likely to visit multiple public places, seniors may be more likely to live in a communal setting such as assisted living, etc.).

Comparing Case Rates Over Time by Age Group

Let’s examine the case rates and death rates for the three oldest age groups compared to the younger age groups.

First, we’ll subdivide the weekly data based on under age 50 years vs. age 50 years and older.

# data for age groups younger than 50 years
age_under_50_data <- covid |>
  filter(age_order > 0L & age_order < 8L & sex == "Overall" & race_ethnicity_combined == "Overall") |>
  group_by(age_group)

# data for age groups 50 years or older
age_over_50_data <- covid |>
  filter(age_order >= 8L & sex == "Overall" & race_ethnicity_combined == "Overall") |>
  group_by(age_group)

We’ll plot each of the age groups within both subsets, but specifically highlight the individual age groups that are age 50 years and older.

Here is a plot of weekly COVID-19 case rates over time by age groups.

# plot of case rates over time by age group
ggplot() +
geom_line(data = age_under_50_data,
          mapping = aes(x = end_of_week, y = case_crude_rate_suppressed_per_100k, group = age_group), color = 'lightgray', na.rm = TRUE) +
geom_line(data = age_over_50_data,
          mapping = aes(x = end_of_week, y = case_crude_rate_suppressed_per_100k, group = age_group, color = age_group), na.rm = TRUE) +
scale_color_manual(name = '',
                    breaks = c('50 - 64 Years', '65 - 74 Years', '75+ Years'),
                   values = c('50 - 64 Years' = 'deepskyblue3', '65 - 74 Years' = 'blueviolet', '75+ Years' = 'brown1')) +
labs(title = "COVID-19 Weekly Case Rates (Mar 2020 - Nov 2023)",
     subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
     x = "", y = "Case Rate") +
theme(legend.title=element_blank()) +
theme(plot.subtitle = element_text(color = "darkgray")) +
theme(legend.position = "bottom")

The plot includes a line for the case rate of each age group, though it specifically identifies only the three oldest age groups (i.e., the younger age groups are represented by unlabeled gray lines).

The pattern seen here is more unclear, though there are several notable changes over time:

  • In spring 2020 (the start of the pandemic), the case rates for the older age groups are among the highest.

  • From summer 2020 through spring 2021 (which includes the first large peak in cases), the case rates for the older age groups are in the middle (with some of the younger age groups having higher case rates and some of the younger age groups having lower case rates).

  • From summer 2021 through spring 2022 (which includes the second and largest peak in cases), the case rates for the older age groups are among the lowest.

  • From summer 2022 through fall 2023 (when dataset ends), the case rates for the older age groups are once again among the highest. This is the longest and most recent time frame within this dataset.

This changing pattern likely reflects the changing nature of this infectious disease, as well as our response to it (e.g., changes in mask mandates, changes in lockdown protocols, appearance of new viral variants, introduction of new vaccine variants, increased availability of at-home testing, etc.).

If anything, this patterns indicates the importance of collective action, which is the cornerstone of public health measures: actions by many individuals help protect not only themselves but also the community at large. At any given time, some groups may be more at risk than others, but it is through collective action that our collective risk can be reduced for the benefit of all.

Comparing Death Rates Over Time by Age Group

Here is a plot of weekly COVID-19 death rates over time by age groups.

# plot of death rates over time by age group
ggplot() +
geom_line(data = age_under_50_data,
          mapping = aes(x = end_of_week, y = death_crude_rate_suppressed_per_100k, group = age_group), color = 'lightgray', na.rm = TRUE) +
geom_line(data = age_over_50_data,
          mapping = aes(x = end_of_week, y = death_crude_rate_suppressed_per_100k, group = age_group, color = age_group), na.rm = TRUE) +
scale_color_manual(name = '',
                    breaks = c('50 - 64 Years', '65 - 74 Years', '75+ Years'),
                   values = c('50 - 64 Years' = 'deepskyblue3', '65 - 74 Years' = 'blueviolet', '75+ Years' = 'brown1')) +
labs(title = "COVID-19 Weekly Death Rates (Mar 2020 - Nov 2023)",
     subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
     x = "", y = "Death Rate") +
theme(legend.title=element_blank()) +
theme(plot.subtitle = element_text(color = "darkgray")) +
theme(legend.position = "bottom")

The plot reveals how much higher the death rates were for the older age groups, especially the oldest group of age 75+ years. The younger groups are represented by gray lines, but their rates are so much lower that their lines are barely distinguishable from the zero axis.

As another summary, here are the total death counts and percent of total deaths for the three oldest age groups compared to all other age groups (i.e., under 50 years).

# create dataframes for various age ranges/groups
under50_data <- covid |>
  filter(sex == "Overall" & age_order > 0L & age_order < 8L & race_ethnicity_combined == "Overall")

age50_64_data <- covid |>
  filter(sex == "Overall" & age_group == '50 - 64 Years' & race_ethnicity_combined == "Overall")

age65_74_data <- covid |>
  filter(sex == "Overall" & age_group == '65 - 74 Years' & race_ethnicity_combined == "Overall")

age75_plus_data <- covid |>
  filter(sex == "Overall" & age_group == '75+ Years' & race_ethnicity_combined == "Overall")

# calculate total deaths overall and for each age range/group
total_deaths <- sum(weekly_data$death_count_suppressed, na.rm = TRUE)

under50_deaths <- sum(under50_data$death_count_suppressed, na.rm = TRUE)

age50_64_deaths <- sum(age50_64_data$death_count_suppressed, na.rm = TRUE)

age65_74_deaths <- sum(age65_74_data$death_count_suppressed, na.rm = TRUE)

age75_plus_deaths <- sum(age75_plus_data$death_count_suppressed, na.rm = TRUE)

# create table to show percent of deaths by age range/group
age_range <- c("75+ Years", "65 - 74 Years", "50 - 64 Years", " Under 50 Years")
death_total <- c(age75_plus_deaths, age65_74_deaths, age50_64_deaths, under50_deaths)
perc_deaths <- c(round(age75_plus_deaths / total_deaths * 100, digits = 1),
                 round(age65_74_deaths / total_deaths * 100, digits = 1),
                 round(age50_64_deaths / total_deaths * 100, digits = 1),
                 round(under50_deaths / total_deaths * 100, digits = 1))
deaths_age_df <- data.frame(age_range, death_total, perc_deaths)

deaths_age_df
##         age_range death_total perc_deaths
## 1       75+ Years      104685        57.5
## 2   65 - 74 Years       39601        21.7
## 3   50 - 64 Years       28509        15.7
## 4  Under 50 Years        8441         4.6

Individuals age 75+ years accounted for ≈58% of all deaths, and individuals between the ages of 50 - 74 years accounted for ≈37% of all deaths. Individuals younger than 50 years accounted for less than 5% of all deaths.

Prevention efforts should be focused on reducing overall community transmission (fewer cases = fewer deaths). However, given the greatly increased risk of death for older individuals, it may be important to focus certain prevention efforts on older adults and those who specifically interact with them (such as family, friends, co-workers, staff at healthcare facilities, staff at assisted living, etc.).


Cases and Deaths by Race/Ethnicity

Finally, let’s examine patterns and disparities that may exist in COVID-19 cases and deaths by race/ethnicity group.

First, we’ll generate side-by-side plots of the weekly case rates over time for each race/ethnicity group.

# faceted plots of case rates over time by race/ethnicity group
covid |>
  filter(sex == "Overall" & age_group == "Overall" & race_ethnicity_combined != "Overall" ) |>
  ggplot() +
  geom_line(mapping = aes(x = end_of_week, y = case_crude_rate_suppressed_per_100k), color = 'red', na.rm = TRUE) +
  labs(title = "COVID-19 Weekly Case Rates by Race/Ethnicity (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "", y = "Case Rate") +
  theme_hc() + 
  theme(plot.subtitle = element_text(color = "darkgray")) +
  theme(panel.spacing = unit(0.5, "cm", data = NULL)) +
  facet_wrap(~race_ethnicity_combined, ncol = 2)

Overall, the general pattern in the case rates among the race/ethnicity groups seem relatively similar, though there appears to be differences in amplitude (i.e., higher or lower case rates) when comparing specific points in time.

Next, we’ll generate side-by-side plots of the weekly death rates over time for each race/ethnicity group.

# faceted plots of death rates over time by race/ethnicity group
covid |>
  filter(sex == "Overall" & age_group == "Overall" & race_ethnicity_combined != "Overall" ) |>
  ggplot() +
  geom_line(mapping = aes(x = end_of_week, y = death_crude_rate_suppressed_per_100k), color = 'red', na.rm = TRUE) +
  labs(title = "COVID-19 Weekly Death Rates by Race/Ethnicity (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "", y = "Death Rate") +
  theme_hc() + 
  theme(plot.subtitle = element_text(color = "darkgray")) +
  theme(panel.spacing = unit(0.5, "cm", data = NULL)) +
  facet_wrap(~race_ethnicity_combined, ncol = 2)

These plots reveal more variation among the groups.

  • The overall patterns for Black, NH and White, NH seem more similar to each other than to other groups.

  • The overall patterns for Asian/PI, NH and Hispanic seem more similar to each other than to other groups.

  • The overall pattern for AI/AN, NH is the most different. For most weeks in the dataset, the death rate for this group is missing (NA), meaning it was suppressed due the death count being 5 or fewer in the given week. When the death rates were recorded for this group, the rates tended to be relatively high.

Let’s calculate and compare the mean case rates and mean death rates by race/ethnicity group.

# calculate average case rate and average death rate by race/ethnicity group
covid |>
  filter(age_group == "Overall" & sex == "Overall" & race_ethnicity_combined != "Overall") |>
  group_by(race_ethnicity_combined) |>
  summarise(mean_case_rate = mean(case_crude_rate_suppressed_per_100k, na.rm = TRUE),
            mean_death_rate = mean(death_crude_rate_suppressed_per_100k, na.rm = TRUE)
  )
## # A tibble: 5 × 3
##   race_ethnicity_combined mean_case_rate mean_death_rate
##   <chr>                            <dbl>           <dbl>
## 1 AI/AN, NH                        146.             5.53
## 2 Asian/PI, NH                      94.3            1.18
## 3 Black, NH                        120.             1.95
## 4 Hispanic                         141.             1.43
## 5 White, NH                        114.             1.78

ANOVA Test for Case Rates by Race/Ethnicity

Let’s conduct an Analysis of Variance (ANOVA) test to determine if there are differences in the mean rates among the race/ethnicity groups.

For this test, we’ll create a dataframe with weekly COVID-19 data by race/ethnicity group.

# weekly data for race/ethnicity groups
weekly_race_eth <- covid |>
  filter(age_group == "Overall" & sex == "Overall" & race_ethnicity_combined != "Overall")

ANOVA testing assumes the data are independent and identically distributed. The distribution within each group should be (approximately) normal, and the variance within each group should be consistent.

Let’s examine histograms of the weekly case rates for each race/ethnicity group to see whether the distributions are approximately normal.

# histograms of weekly case rates by race/ethnicity group
weekly_race_eth |>
  ggplot() +
  geom_histogram(mapping = aes(x = case_crude_rate_suppressed_per_100k),
                 na.rm = TRUE, binwidth = 50) +
  labs(title = "COVID-19 Weekly Case Rates (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "Weekly Case Rate",
       y = "Frequency") +
  facet_wrap(~ race_ethnicity_combined, nrow = 1) +
  theme(plot.subtitle = element_text(colour = "darkgray"))

As the histograms show, the distributions of case rate are appear to have a right skew, as they are bound by zero and cannot have negative values. For zero-bound data, a log transform can often be used to produce values for the data that have a more normal distribution.

Let’s use add a new column to the dataframe that performs a log transform on the case rate.

# add new column for log(case rate)
weekly_race_eth <- weekly_race_eth |>
  mutate(log_case_rate = log(case_crude_rate_suppressed_per_100k))

Next, let’s examine histograms of the log(case rate) for each race/ethnicity group to see whether these distributions are closer to a normal distribution.

# histograms of log(case rate) by race/ethnicity group
weekly_race_eth |>
  ggplot() +
  geom_histogram(mapping = aes(x = log_case_rate),
                 na.rm = TRUE, binwidth = 0.25) +
  labs(title = "COVID-19 log(Case Rate) (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "log(Case Rate)",
       y = "Frequency") +
  facet_wrap(~ race_ethnicity_combined, nrow = 1) +
  theme(plot.subtitle = element_text(colour = "darkgray"))

Overall, these distributions are closer to normal than the original case rates.

Let’s calculate the variance within each race/ethnicity group for log(case rate).

# calculate variance and standard deviation in log(case rate) by race/ethnicity group
weekly_race_eth |>
  group_by(race_ethnicity_combined) |>
  summarise(
    log_case_var = var(log_case_rate, na.rm = TRUE),
    log_case_sd = sd(log_case_rate, na.rm = TRUE)
  )
## # A tibble: 5 × 3
##   race_ethnicity_combined log_case_var log_case_sd
##   <chr>                          <dbl>       <dbl>
## 1 AI/AN, NH                      1.19        1.09 
## 2 Asian/PI, NH                   0.975       0.987
## 3 Black, NH                      0.730       0.855
## 4 Hispanic                       1.03        1.01 
## 5 White, NH                      1.14        1.07

The variance and standard deviations are roughly similar across the groups.

While our dataset might not be an ideal match to the assumptions for an ANOVA test, let’s proceed.

The null hypothesis for the first ANOVA test will be:

\[ H_0 : \text{The average COVID-19 weekly log(case rate) is equal across all race/ethnicity groups.} \]

The alternative hypothesis is:

\[ H_a: \text{The average COVID-19 weekly log(case rate) is not equal across all race/ethnicity groups.} \]

We’ll use the ANOVA test to determine whether the null hypothesis should be rejected and whether the alternative hypothesis should be assumed to be accepted.

Let’s perform the ANOVA test.

# ANOVA test summary of log(case rate) by race/ethnicity group
case_m <- aov(log_case_rate ~ race_ethnicity_combined, data = weekly_race_eth)

summary(case_m)
##                          Df Sum Sq Mean Sq F value   Pr(>F)    
## race_ethnicity_combined   4   23.8   5.953   5.883 0.000112 ***
## Residuals               964  975.5   1.012                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness

The ANOVA test has a very low p-value and a high F value, which both indicate that the null hypothesis should likely be rejected, leading us to assume that the average log(case rate) is not equal across race/ethnicity groups. Thus, we assume there is a difference across race/ethnicity groups.

Next let’s conduct pairwise t-tests to identify which group(s) are different from the others.

# pairwise t-tests of race/ethnicity groups
pairwise.t.test(weekly_race_eth$log_case_rate,
                weekly_race_eth$race_ethnicity_combined, p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  weekly_race_eth$log_case_rate and weekly_race_eth$race_ethnicity_combined 
## 
##              AI/AN, NH Asian/PI, NH Black, NH Hispanic
## Asian/PI, NH 0.00419   -            -         -       
## Black, NH    1.00000   0.00517      -         -       
## Hispanic     1.00000   0.00046      1.00000   -       
## White, NH    0.45298   1.00000      0.51958   0.10726 
## 
## P value adjustment method: bonferroni

Here we see that the p-value is very low when the Asian/PI, NH group is compared to the groups for AI/AN, NH, Black, NH, and Hispanic. This indicates that we can reject the null hypothesis that this group has no difference in average log(case rate) compared to these other groups.

However, the p-value is high when Asian/PI, NH is compared to White, NH meaning we cannot reject the hypothesis that these two groups have no difference in average log(case rate).

Futhermore, the p-value is high when comparing any of the other groups to each other. This indicates that we cannot reject the null hypothesis that these groups have no difference in average log(case rate).

The lower case rate among the Asian/PI, NH group may be a result of this group being relatively smaller within the general population. Perhaps there are social or cultural factors that might explain why their case rates are lower compared to most other groups (except for White, NH).

ANOVA Test for Death Rates by Race/Ethnicity

Next, let’s examine histograms of the weekly death rates for each race/ethnicity group to see whether the distributions are approximately normal.

# histograms of weekly death rates by race/ethnicity group
weekly_race_eth |>
  ggplot() +
  geom_histogram(mapping = aes(x = death_crude_rate_suppressed_per_100k),
                 na.rm = TRUE, binwidth = 0.5) +
  labs(title = "COVID-19 Weekly Death Rates (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "Weekly Death Rate",
       y = "Frequency") +
  facet_wrap(~ race_ethnicity_combined, nrow = 1) +
  theme(plot.subtitle = element_text(colour = "darkgray"))

Again, the distributions of death rate appear to have a right skew, as they are bound by zero and cannot have negative values. The distribution for the American Indian/Alaska Native group is the most different.

Let’s add a log transform for death rate to see if this produces distributions that are closer to a normal distribution.

# add new column for log(death rate)
weekly_race_eth <- weekly_race_eth |>
  mutate(log_death_rate = log(death_crude_rate_suppressed_per_100k))

Now let’s examine histograms of the log(death rate) for each race/ethnicity group to see whether the distributions are approximately normal.

# histograms of log(death rate) by race/ethnicity group
weekly_race_eth |>
  ggplot() +
  geom_histogram(mapping = aes(x = log_death_rate),
                 na.rm = TRUE, binwidth = 0.25) +
  labs(title = "COVID-19 log(Death Rate) (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "log(Death Rate)",
       y = "Frequency") +
  facet_wrap(~ race_ethnicity_combined, nrow = 1) +
  theme(plot.subtitle = element_text(colour = "darkgray"))

Overall, these distributions are closer to normal than the original case rates (though the AI/AN, NH group is still quite different).

Let’s calculate the variance within each race/ethnicity group for log(death rate).

# calculate variance and standard deviation in log(death rate) by race/ethnicity group
weekly_race_eth |>
  group_by(race_ethnicity_combined) |>
  summarise(
    log_death_var = var(log_death_rate, na.rm = TRUE),
    log_death_sd = sd(log_death_rate, na.rm = TRUE)
  )
## # A tibble: 5 × 3
##   race_ethnicity_combined log_death_var log_death_sd
##   <chr>                           <dbl>        <dbl>
## 1 AI/AN, NH                       0.198        0.445
## 2 Asian/PI, NH                    0.577        0.760
## 3 Black, NH                       1.46         1.21 
## 4 Hispanic                        1.29         1.14 
## 5 White, NH                       1.52         1.23

The variance and standard deviations for the Black, Hispanic, and White groups are roughly similar. The AI/AN and Asian/PI group are different from each other and from the other three groups.

Again, while our dataset might not be an ideal match to the assumptions for an ANOVA test, let’s proceed.

The null hypothesis for the second ANOVA test will be:

\[ H_0 : \text{The average COVID-19 weekly log(death rate) is equal across all race/ethnicity groups.} \]

The alternative hypothesis is:

\[ H_a: \text{The average COVID-19 weekly log(death rate) is not equal across all race/ethnicity groups.} \]

Let’s perform the ANOVA test.

# ANOVA test summary of log(death rate) by race/ethnicity group
death_m <- aov(log_death_rate ~ race_ethnicity_combined, data = weekly_race_eth)

summary(death_m)
##                          Df Sum Sq Mean Sq F value Pr(>F)    
## race_ethnicity_combined   4  108.5  27.135   22.18 <2e-16 ***
## Residuals               652  797.5   1.223                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 313 observations deleted due to missingness

The ANOVA test has a very low p-value and a very high F value, which both indicate that the null hypothesis should likely be rejected, leading us to assume that the average log(case rate) is not equal across race/ethnicity groups. Thus, we assume there is a difference across race/ethnicity groups.

Next let’s conduct pairwise t-tests to identify which group(s) are different from the others.

# pairwise t-tests of race/ethnicity groups
pairwise.t.test(weekly_race_eth$log_death_rate,
                weekly_race_eth$race_ethnicity_combined, p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  weekly_race_eth$log_death_rate and weekly_race_eth$race_ethnicity_combined 
## 
##              AI/AN, NH Asian/PI, NH Black, NH Hispanic
## Asian/PI, NH 5.0e-15   -            -         -       
## Black, NH    1.6e-14   1.00         -         -       
## Hispanic     < 2e-16   1.00         0.86      -       
## White, NH    3.7e-16   1.00         1.00      1.00    
## 
## P value adjustment method: bonferroni

Here we see that the p-value is extremely low when the AI/AN, NH group is compared to each of the other groups. This indicates that we can reject the null hypothesis that this group has no difference in average log(death rate) compared to each of the other groups.

However, the p-value is high when comparing any of the other groups to each other. This indicates that we cannot reject the null hypothesis that these groups have no difference in average log(death rate).

Fortunately, there do not appear to be significant differences in mean death rates among the race/ethnicity groups, except for the American Indian / Alaska Native group - although this may be the result of the CDC suppression of data for low counts.

For most of the weeks in the dataset, the AI/AN, NH group had missing values (NA) for death count and death rate, indicating the true value for the death count in those weeks was 5 or fewer and was thus suppressed by the CDC.

# count number of weeks of missing (suppressed) data for AI/AN death rate
weekly_race_eth |>
  filter(race_ethnicity_combined == 'AI/AN, NH' & is.na(death_crude_rate_suppressed_per_100k)) |>
  summarise(missing_weeks = n())
## # A tibble: 1 × 1
##   missing_weeks
##           <int>
## 1           156

Out of 194 weeks in the dataset, the death rate for AI/AN, NH was missing for 156 of those weeks.

Since NA values are excluded from the calculation used for mean, this would artificially increase the mean death rate for this group (especially since the true rates for those weeks were either zero or very low). If this is indeed true, then there might not be any difference in mean death rates across race/ethnicity groups.

If anything, the lack of strong evidence for a difference in death rates among race/ethnicity groups reinforces the fact that everyone within a community is at risk and should take appropriate preventative measures to protect their own health as well as the health of the community at large. COVID-19 is an equal opportunity virus.

Lastly, we do know that there are health inequities (such as health insurance coverage, access to healthcare, etc.) among race/ethnicity groups, so it is imperative that public health officials continue to monitor outcomes by different sociodemographic variables to identify inequities that may exist (or may arise), so that public policy makers can address those inequities as feasible.


Conclusion

Based on these analyses, here is a summary of the insights and recommendations from this project:

Insight: Pattern in death counts mirrors case counts

Recommendation: Preventing and reducing cases within the community is the most effective way to reduce deaths among the most vulnerable within the community

Insight: Cases and deaths follow yearly seasonal pattern

Recommendation: Prevention efforts should be timed to reduce cases (and thus deaths) during projected peaks of yearly cycle

Insight: Females have higher case rates

Recommendation: Prevention efforts targeted to females could better protect them from cases

Insight: Males have higher death rates

Recommendation: Prevention efforts targeted to males could better protect them from deaths

Insight: Older age groups have higher death rates

Recommendation: Prevention efforts targeted to older groups (and those that interact closely with them) could help reduce deaths

Insight: Death rates might not vary among race/ethnicity groups

Recommendation: Prevention efforts should be targeted to and accessible by all race/ethnicity groups to help ensure equitable health outcomes


Copyright © 2024 Michael Frontz

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. You are free to use, share, or adapt this material for non-commercial purposes as long as you provide proper attribution and distribute any copies or adaptations under this same license.