COVID-19 Weekly Cases and Deaths by Age, Race/Ethnicity, and Sex (Mar 2020 - Nov 2023)

This data dive further explores a CDC dataset that provides weekly data of COVID-19 cases and deaths reported in the states of Illinois, Indiana, Michigan, Minnesota, Ohio, and Wisconsin from March 7, 2020 through November 18, 2023.

The week’s data dive focuses on:


Prepare Dataset

To get started, let’s load a couple of R packages to assist with our analysis.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)

Next, let’s read in the dataset from CSV.

covid <- read_delim("./COVID_weekly_cases_deaths_region5.csv", delim = ",")
## Rows: 37867 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): end_of_week, jurisdiction, age_group, sex, race_ethnicity_combined
## dbl (4): case_count_suppressed, death_count_suppressed, case_crude_rate_supp...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Let’s convert end_of_week from a character format to a date format to help with our analysis.

covid$end_of_week <- as.Date(covid$end_of_week, format="%m/%d/%y")

Let’s add a new column to indicate the order of age groups, and then arrange all the rows in nested order by week, age, sex, and race/ethnicity.

# Add new column to order age groups correctly
covid <- covid |>
  mutate(age_order = case_when(
    age_group == "Overall" ~ 0L,
    age_group == "0 - 4 Years" ~ 1L,
    age_group == "5 - 11 Years" ~ 2L,
    age_group == "12 - 15 Years" ~ 3L,
    age_group == "16 - 17 Years" ~ 4L,
    age_group == "18 - 29 Years" ~ 5L,
    age_group == "30 - 39 Years" ~ 6L,
    age_group == "40 - 49 Years" ~ 7L,
    age_group == "50 - 64 Years" ~ 8L,
    age_group == "65 - 74 Years" ~ 9L,
    age_group == "75+ Years" ~ 10L),
    .after = age_group) |>
  arrange(end_of_week, age_order, sex, race_ethnicity_combined)

Percent of Weekly Cases and Weekly Deaths by Sex At Birth

Let’s the weekly case counts and death counts to create two new columns that calculate the percent of cases and the percent of deaths each week by sex at birth (female vs male).

# For each week, calculate percent of cases and percent of deaths by Sex at Birth
covid_sex_percent <- covid |>
  filter(race_ethnicity_combined == "Overall" & age_group == "Overall" & sex != "Overall" ) |>
  group_by(end_of_week) |>
  mutate(percent_weekly_cases = round(case_count_suppressed / sum(case_count_suppressed, na.rm = TRUE), 3)) |>
  mutate(percent_weekly_deaths = round(death_count_suppressed / sum(death_count_suppressed, na.rm = TRUE), 3))

Let’s plot the weekly values for percent of cases vs. percent of deaths by sex at birth.

covid_sex_percent |>
  ggplot() +
  geom_point(mapping = aes(x = percent_weekly_cases, y = percent_weekly_deaths, color = sex), na.rm = TRUE) +
  labs(x = "Percent of Weekly Cases",
       y = "Percent of Weekly Deaths",
       title = "Percent of Weekly Cases vs. Percent of Weekly Deaths by Sex",
       subtitle = "COVID-19 Data for IL, IN, MI, MN, OH, WI (March 7, 2020 - November 18, 2023)") +
  theme_minimal() +
  theme(plot.subtitle = element_text(colour = "darkgray"))

Within the plot, there is nearly complete separation of the data points for females versus males. For almost every week in the dataset, females accounted for more of the weekly cases (i.e., 50% or higher). However, males tended to account for more of the weekly deaths, despite accounting for fewer of the weekly cases.

Correlation Coefficient

Let’s calculate the Pearson’s correlation coefficient for percent of weekly cases and percent of weekly deaths by sex at birth.

round(cor(covid_sex_percent$percent_weekly_cases, covid_sex_percent$percent_weekly_deaths, use="complete.obs"), 2)
## [1] -0.36

The Pearson’s correlation coefficient indicates a negative linear relationship between percent of weekly cases and percent of weekly deaths: as the percent of weekly cases increases, the percent of weekly deaths decreases. However, this correlation is calculated from the combined data for females and males, which form two distinct clusters in the plot.

For comparison, let’s calculate the Pearson’s correlation coefficient within each sex group.

covid_sex_percent |>
  group_by(sex) |>
  summarise(cor = round(cor(percent_weekly_cases, percent_weekly_deaths, use="complete.obs"), 2))
## # A tibble: 2 × 2
##   sex      cor
##   <chr>  <dbl>
## 1 Female  0.35
## 2 Male    0.35

The Pearson’s correlation coefficient within each sex group is a positive value and is the same for females compared to males. Both values indicate a positive linear relationship between the percent of weekly cases and the percent of weekly deaths within each sex: as the percent of weekly cases increases, the percent of weekly deaths also increases.

This is a good example of how a correlation between two variables can be confounded by a third variable.


Difference in Case Rates and Death Rates by Race/Ethnicity Group

Let’s use the weekly case rates and death rates to create two new columns that calculate the difference between the case rate and death rate for each race/ethnicity group compared to the mean rates for all race/ethnicity groups. This will indicate how much higher or lower a particular group’s rate was compared to the overall mean rate.

covid_race_eth_diff <- covid |>
  filter(age_group == "Overall" & sex == "Overall" & race_ethnicity_combined != "Overall") |>
  group_by(end_of_week) |>
  mutate(case_rate_diff = round(case_crude_rate_suppressed_per_100k - mean(case_crude_rate_suppressed_per_100k, na.rm = TRUE), 2)) |>
  mutate(death_rate_diff = round(death_crude_rate_suppressed_per_100k - mean(death_crude_rate_suppressed_per_100k, na.rm = TRUE), 2))

Let’s plot the weekly values for difference in case rate vs. difference in death rate by race/ethnicity group.

covid_race_eth_diff |>
  ggplot() +
  geom_point(mapping = aes(x = case_rate_diff, y = death_rate_diff, color = race_ethnicity_combined), na.rm = TRUE, alpha = 0.7) +
  labs(x = "Difference in Weekly Case Rate",
       y = "Difference in Weekly Death Rate",
       title = "Differences in Weekly Case Rate vs. Weekly Death Rate by Race/Ethnicity",
       subtitle = "COVID-19 Data for IL, IN, MI, MN, OH, WI (March 7, 2020 - November 18, 2023)") +
  theme_minimal() +
  theme(plot.subtitle = element_text(colour = "darkgray"))

While it is somewhat challenging visually to compare all five race/ethnicity groups simultaneously in a single plot, it does seems clear that their patterns vary from one another.

For example, the points for the Asian/PI, NH group are clustered in a diagonal band in the lower left quadrant (weekly case rate lower than mean and weekly death rate lower than mean). By comparison, the points for the Hispanic group fall primarily in the lower right quadrant (weekly case rate higher than mean but weekly death rate lower than mean), though they appear to be widely dispersed horizontally.

Correlation Coefficient

Let’s calculate the Pearson’s correlation coefficient for difference in weekly case rate and difference in weekly death rate by race/ethnicity.

round(cor(covid_race_eth_diff$case_rate_diff, covid_race_eth_diff$death_rate_diff, use="complete.obs"), 2)
## [1] 0.33

The Pearson’s correlation coefficient indicates a positive linear relationship between the difference in weekly case rate and the difference in weekly death rate: as the difference in weekly case rate increases, the difference in weekly death rate also increases. However, this correlation is calculated from the combined data for all race/ethnicity groups, which appear to have different patterns in the plot.

For comparison, let’s calculate the Pearson’s correlation coefficient within each race/ethnicity group.

covid_race_eth_diff |>
  group_by(race_ethnicity_combined) |>
  summarise(cor = round(cor(case_rate_diff, death_rate_diff, use="complete.obs"), 2))
## # A tibble: 5 × 2
##   race_ethnicity_combined   cor
##   <chr>                   <dbl>
## 1 AI/AN, NH                0.13
## 2 Asian/PI, NH             0.78
## 3 Black, NH                0.2 
## 4 Hispanic                 0.07
## 5 White, NH                0.31

The Pearson’s correlation coefficient varies widely among the race/ethnicity groups. The value is very high for the Asian/PI, NH group indicating a strong positive correlation within this group. However, the value for the Hispanic group is close to zero, indicating almost no correlation within this group. This matches with the visual patterns seen in the plot.


Percent of Weekly Deaths by Age Group

Finally, let’s explore two variables that form a likely pair as an explanatory variable (independent variable) and a response variable (dependent variable): age group and weekly deaths.

In general, people’s immune systems become less effective with increasing age. In addition, older people may have other underlying conditions that would make them more susceptible to death from COVID-19. Thus, we might expect age group to affect the number of weekly deaths from COVID-19.

Let’s use the weekly death counts to create a new column that calculates the percent of deaths each week by age group (an ordered discrete variable).

# For each week, calculate percent of deaths by Age Group
covid_age_percent <- covid |>
  filter(race_ethnicity_combined == "Overall" & sex == "Overall" & age_group != "Overall") |>
  group_by(end_of_week) |>
  mutate(percent_weekly_deaths = round(death_count_suppressed / sum(death_count_suppressed, na.rm = TRUE), 3))

Let’s plot the weekly values for percent of deaths by age group.

covid_age_percent |>
  ggplot() +
  geom_point(mapping = aes(x = fct_reorder(age_group, age_order), y = percent_weekly_deaths), 
             na.rm = TRUE, size = 0.5) +
  scale_x_discrete(guide = guide_axis(angle = 90)) +
  labs(x = "Age Group",
       y = "Percent of Weekly Deaths",
       title = "Age Group vs. Percent of Weekly Deaths",
       subtitle = "COVID-19 Data for IL, IN, MI, MN, OH, WI (March 7, 2020 - November 18, 2023)") +
  theme_minimal() +
  theme(plot.subtitle = element_text(colour = "darkgray"))

The plot seems to show a strong effect of age group on the percent of weekly deaths. In particular, the oldest age group accounts for a much higher percentage of weekly deaths.

Correlation Coefficient

Because age group is a discrete variable, we’ll calculate the Spearman’s rank correlation for age group and percent of weekly deaths. For this calculation, we’ll substitute the age_order variable, which represents the ordered ranks for the age groups.

# Note: age_order represents the rank of each ordered age_group
round(cor(as.numeric(covid_age_percent$age_order), covid_age_percent$percent_weekly_deaths, use="complete.obs", method = "spearman"), 2)
## [1] 0.94

The Spearman’s rank correlation indicates a very strong positive relationship between age group and percent of weekly deaths: as age group increases, the percent of weekly deaths also increases. Based on the plot, it appears that this relationship is more likely to be exponential than linear.

Bootstrapping, Standard Error, and Confidence Interval

Let’s use bootstrapping to find the standard error for a sample statistic and use that to calculate the 95% confidence interval for the population parameter.

Bootstrapping involves simulating the selection of numerous random samples from the original dataset. Each random sample will be as large as the original dataset (i.e., same number of rows), but the sampling will be done with replacement, meaning that some rows will be included more than once in a given sample while other rows might not be included in a given sample.

For each sample, a statistic will be calculated (such as the mean of a variable), and the set of statistics from all the simulated samples will form a sampling distribution that is a Gaussian (normal) distribution.

Here is a generic bootstrap function that accepts a dataset and defaults to calculating a mean from 10,000 random samples with replacement (though the function can be called with arguments to change the statistic and the number of samples).

# bootstrap function defaults to calculating means for 10000 random samples from dataset x
bootstrap <- function (x, func=mean, n_iter=10^4) {
  # empty vector to be filled with values from each iteration
  func_values <- c(NULL)
  
  # simulate sampling `n_iter` times
  for (i in 1:n_iter) {
    # get random sample that is same size as original dataset but sampled with replacement
    x_sample <- sample(x, size = length(x), replace = TRUE)
    
    # add this sample's value to the set
    func_values <- c(func_values, func(x_sample, na.rm = TRUE))
  }
  
  return(func_values)
}

For this bootstrapping simulation, we’ll focus on the percent of weekly deaths for the oldest age group: people age 75+ years.

First, we’ll create a filtered dataframe containing just the weekly data for this age group. Then we’ll use the bootstrap function to generate a random sampling distribution of the mean percent of weekly deaths for the 75+ Years age group.

age75plus_deaths <- covid_age_percent |>
  filter(age_group == "75+ Years") |>
  pluck("percent_weekly_deaths")

percent_death_means <- bootstrap(age75plus_deaths)

Let’s plot a histogram of our bootstrap samples to see if the simulated sampling distribution approximates a normal distribution.

ggplot() +
  geom_histogram(mapping = aes(x = percent_death_means), binwidth = 0.001, color='white') +
  labs(title = "10K Bootstrapped Sample Means of Weekly Percent Deaths for Age 75+ Years",
       subtitle = "A *simulation* of the true sampling distribution",
       x = "Bootstrapped Sample Mean",
       y = "Number of Samples") +
  theme_minimal()

The bootstrapped sampling distribution does appear to be approximately normal in distribution. Let’s use the standard deviation of this bootstrapped sampling distribution to estimate the standard error of the true sampling distribution.

percent_deaths_se_boot <- sd(percent_death_means)
percent_deaths_se_boot
## [1] 0.00972375

Let’s calculate the mean percent of weekly deaths from our original sample dataset.

avg_percent_deaths <- mean(age75plus_deaths, na.rm = TRUE)
avg_percent_deaths
## [1] 0.6208446

Now we can use this mean percent of weekly deaths from the original dataset and the estimated standard error from our bootstrap simulation to calculate the 95% confidence interval for the true parameter of percent of weekly deaths for the 75+ Years age group.

P <- .95  # % confidence
n <- length(percent_death_means)

# z-score
z_score <- qnorm(p=(1 - P)/2, lower.tail=FALSE)

# z half-width
CI_z <- z_score * percent_deaths_se_boot

c(avg_percent_deaths - CI_z, avg_percent_deaths + CI_z)
## [1] 0.6017864 0.6399028

By using bootstrapping with our original dataset, we’re able to determine the true population mean for the percent of weekly deaths for the age 75+ Years group will be within this interval 95% of the time. This also confirms the strong affect of age on the percent of weekly deaths, with this oldest age group accounting for the majority of weekly deaths during the timeframe of this data collection.


This dataset will be explored further in subsequent data dives.