COVID-19 Weekly Cases and Deaths by Age, Race/Ethnicity, and Sex (Mar 2020 - Nov 2023)

This data dive will focus on hypothesis testing using a CDC dataset that contains weekly counts and rates of COVID-19 cases and deaths reported in Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, and Wisconsin) of the United States from March 7, 2020 through November 18, 2023.


Load Libraries and Dataset

To get started, we’ll load a few R packages that will assist with our analysis.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(effsize)  # Needed to calculate effect size using cohen.d
library(pwrss)    # Needed to calculate sample size using pwrss.t.2means
## 
## Attaching package: 'pwrss'
## 
## The following object is masked from 'package:stats':
## 
##     power.t.test

Next, let’s read in the dataset from CSV.

covid <- read_delim("./COVID_weekly_cases_deaths_region5.csv", delim = ",")
## Rows: 37867 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): end_of_week, jurisdiction, age_group, sex, race_ethnicity_combined
## dbl (4): case_count_suppressed, death_count_suppressed, case_crude_rate_supp...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Hypothesis Testing - Case Rate and Sex At Birth

The first hypothesis test will involve comparing the weekly COVID-19 case rates by sex at birth.

First, we’ll prepare a dataframe that filters out the rows with “Overall” subtotals that sum the data for other rows representing the same week. This will leave a set of rows for each week that provides data for distinct demographic subgroups.

We’ll also filter out rows with missing data for the case rate, since these rows will have to be excluded when calculating a mean.

# WEEKLY CASE RATE DATA
# Exclude rows that summarize overall subtotals for each week
# Exclude rows with missing case rate (NA)
weekly_case_data <- covid |>
  filter(age_group != "Overall" & sex != "Overall" & race_ethnicity_combined != "Overall") |>
  filter(!is.na(case_crude_rate_suppressed_per_100k))

Let’s create a boxplot to see how the distribution of weekly case rates compares for females versus males.

# Boxplot of Weekly Case Rates by Sex At Birth
weekly_case_data |>
  ggplot() +
  geom_boxplot(mapping = aes(x = case_crude_rate_suppressed_per_100k, y = sex, fill = sex), show.legend = FALSE) +
  labs(title = "COVID-19 Weekly Case Rates by Sex (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
       x = "Weekly Case Rate", y = "Sex At Birth") +
  scale_y_discrete(limits = unique(rev(weekly_case_data$sex))) +
  theme_minimal() +
  theme(plot.subtitle = element_text(colour = "darkgray"))

The boxplot shows that while most of the weekly case rates have lower values (less than 250), there are numerous outliers extending to the right, representing weekly case rates with much higher values (up to 2000 or more).

Null Hypothesis and Alternative Hypothesis

The null hypothesis to be tested is:

\(H_0: \text{Average weekly case rate is equal for both sexes.}\)

The alternative hypothesis is:

\(H_a: \text{Average weekly case rate is not equal for both sexes.}\)

Through hypothesis testing, we’ll attempt to determine if the null hypothesis should be rejected. If we were to reject the null hypothesis as false, we would logically assume the alternative hypothesis should be accepted instead.

Here are the observed mean weekly case rates by sex calculated across the entire dataset:

# Observed Mean Weekly Case Rates by Sex At Birth
avg_case_rates_by_sex <- weekly_case_data |>
  group_by(sex) |>
  summarise(num_data = n(),
            mean_case_rate = mean(case_crude_rate_suppressed_per_100k),
            sd_case_rate = sd(case_crude_rate_suppressed_per_100k)
            ) |>
  arrange(sex)

avg_case_rates_by_sex |>
  select(c(sex, mean_case_rate))
## # A tibble: 2 × 2
##   sex    mean_case_rate
##   <chr>           <dbl>
## 1 Female           137.
## 2 Male             120.

The observed difference in the mean case rates by sex is:

# Observed Difference in Mean Weekly Case Rates
observed_diff_case_rate <- avg_case_rates_by_sex$mean_case_rate[2] - avg_case_rates_by_sex$mean_case_rate[1]

paste("Observed Difference in Mean Case Rate (Male - Female) =", round(observed_diff_case_rate, 1))
## [1] "Observed Difference in Mean Case Rate (Male - Female) = -17.5"

However, merely calculating a difference in the mean case rates is insufficient to test the null hypothesis. We’ll need to consider several parameters important to hypothesis testing.

Parameters for Hypothesis Testing

For this hypothesis testing, we’ll use an alpha level (\(\alpha\)) of 0.01 to determine whether to reject the null hypothesis as false. Traditionally, many research studies have used an alpha level of 0.05 as the basis for determining whether to reject a null hypothesis, though some argue that this practice has lead to problems. Choosing a lower alpha level of 0.01 will reduce the false negative rate, which is the likelihood of rejecting the null hypothesis when it is actually true (Type I Error).

We’ll use a power level (\(1 - \beta\)) of 0.9 which represents the probability of correctly rejecting a false null hypothesis. Traditionally, many research studies use a power level of 0.8. The value of \(\beta\) represents the false positive rate, meaning the likelihood of failing to reject the null hypothesis when it is actually false (Type II Error). Choosing a higher power level will help reduce the false positive rate.

The minimum effect size will be calculated from the dataset using Cohen’s d since we don’t have any particular reason to argue for selecting another value.

# Calculate effect size using Cohen's d function
cohen.d(d = filter(weekly_case_data, sex == 'Female') |> pluck("case_crude_rate_suppressed_per_100k"),
        f = filter(weekly_case_data, sex == 'Male') |> pluck("case_crude_rate_suppressed_per_100k"))
## 
## Cohen's d
## 
## d estimate: 0.1113157 (negligible)
## 95 percent confidence interval:
##      lower      upper 
## 0.08154922 0.14108220

Minimum Sample Size

The table below shows the actual number of data rows for each sex, along with their mean case rates and the standard deviation for their case rates.

avg_case_rates_by_sex
## # A tibble: 2 × 4
##   sex    num_data mean_case_rate sd_case_rate
##   <chr>     <int>          <dbl>        <dbl>
## 1 Female     8739           137.         171.
## 2 Male       8633           120.         142.

We can use the values for our alpha level, power level, and effect size (Cohen’s d) to calculate the minimum sample size needed to test the difference in the two means.

Since we’re using Cohen’s d for mu1, we can set sd1 to 1. Kappa represents the ratio of the actual sample sizes (\(n_{1}/n_{2}\)) for the two samples, which were obtained from the table above.

# Get sample size using pwrss.t.2means from pwrss library
pwrss.t.2means(mu1 = 0.1113157,
                       sd1 = 1,
                       kappa = 8739/8633,
                       power = 0.9, alpha = 0.01, 
                       alternative = "not equal")
##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.9 
##   n1 = 2418 
##   n2 = 2389 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 4805 
##  Non-centrality parameter = 3.859 
##  Type I error rate = 0.01 
##  Type II error rate = 0.1

The calculation for minimum sample size indicates that we will need a minimum of 2418 data points for females (\(n_{1}\)) and 2379 data points for males (\(n_{2}\)), so our existing dataset is sufficient for both samples.

Neyman-Pearson Hypothesis Test

Now can proceed with our hypothesis test. Because we’re comparing the means of two independent samples, we’ll use a two-tailed t-test to calculate the p-value for our observed difference in means.

# Welch's t-test (compare means of two independent samples)
t.test(case_crude_rate_suppressed_per_100k ~ sex, data = weekly_case_data, alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  case_crude_rate_suppressed_per_100k by sex
## t = 7.3441, df = 16849, p-value = 2.165e-13
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  12.82750 22.16743
## sample estimates:
## mean in group Female   mean in group Male 
##             137.1420             119.6445

The calculated p-value from the t-test is much less than our established alpha level of 0.01, which would lead us to reject the null hypothesis as false and assume that the alternative hypothesis is true: that the average weekly case rates by sex are not equal.

Fisher’s Significance Test

Let’s use bootstrapping as another method to calculate a p-value for our observed difference in mean case rates by generating a simulated sampling distribution of our dataset.

First, let’s create a generic bootstrapping function that will randomly sample the dataset (with replacement) for a given number of iterations. This will simulate other samples that might have been collected, allowing us to calculate a statistic for each of these simulated samples.

# Bootstrapping function to generate multiple random samples from original dataset
bootstrap <- function (x, func=mean, n_iter=10^4) {
  # empty vector to store values from each iteration
  func_values <- c(NULL)
  
  # simulate sampling `n_iter` times
  for (i in 1:n_iter) {
    # random sample with replacement (same length as original dataset)
    x_sample <- sample_n(x, size = length(x), replace = TRUE)
    
    # add this iteration's statistic value to set
    func_values <- c(func_values, func(x_sample))
  }
  
  return(func_values)
}

Next, we’ll create a function to calculate the difference in mean case rates by sex for each simulated sample. This function will be passed into the bootstrap function as an argument.

# Function to calculate difference in mean case rates by sex for random sample
# Will be passed into bootstrap function as argument
diff_avg_case_rate <- function (x_data) {
  avg_case_rates <- x_data |>
    group_by(sex) |>
    summarize(mean_case_rate = mean(case_crude_rate_suppressed_per_100k)) |>
    arrange(sex)
  
  # difference in mean case rate (Male - Female)
  diff <- (avg_case_rates$mean_case_rate[2] - 
           avg_case_rates$mean_case_rate[1])
  
  return(diff)
}

Now we can run our bootstrapping simulation.

diffs_avg_case_rates <- bootstrap(weekly_case_data, diff_avg_case_rate, n_iter = 100)

Next we can plot our simulated sampling distribution with a red line added to indicate our observed difference in mean case rates by sex from the original dataset.

f_sampling <- function(x) dnorm(x, mean = 0, 
                                sd = sd(diffs_avg_case_rates))

ggplot() +
  stat_function(mapping = aes(fill = 'More extreme samples'),
                fun = f_sampling, 
                xlim = c(abs(observed_diff_case_rate), 400),
                geom = "area") +
  stat_function(mapping = aes(fill = 'More extreme samples'),
                fun = f_sampling, 
                xlim = c(-400, -abs(observed_diff_case_rate)),
                geom = "area") +
  geom_function(xlim = c(-400, 400), 
                fun = f_sampling) +
  geom_vline(mapping = aes(xintercept = observed_diff_case_rate,
                           color = paste("Observed: ",
                                         round(observed_diff_case_rate, 1)))) +
  labs(title = "Bootstrapped Sampling Distribution of Mean Weekly Case Rate Differences",
       x = "Difference in Mean Weekly Case Rate",
       y = "Probability Density",
       color = "",
       fill = "") +
  scale_x_continuous(breaks = seq(-400, 400, 100)) +
  scale_fill_manual(values = 'lightblue') +
  theme_minimal()

The shaded areas in the plot above represent samples that would have resulted in a more extreme difference in means. This represents the p-value, which is the proportion of samples in which the difference in means would be more extreme than our observed difference.

# "demean" the bootstrapped samples to simulate mu = 0
diffs_avg_case_rates_d <- diffs_avg_case_rates - mean(diffs_avg_case_rates)

# proportion of times the difference is more extreme
paste("p-value =", 
      sum(abs(observed_diff_case_rate) < abs(diffs_avg_case_rates_d)) /
        length(diffs_avg_case_rates_d))
## [1] "p-value = 0.83"

This p-value is much greater than our alpha level (0.01) and indicates that it would be very likely to have selected another sample with a more extreme difference in means. This calculation of p-value would lead us to not reject the null hypothesis and instead assume the null hypothesis is true: that the average weekly case rate is equal for both sexes.


Hypothesis Testing - Death Rate and Sex At Birth

The second hypothesis test will involve comparing the weekly COVID-19 death rates by sex at birth.

First, we’ll prepare a dataframe that filters out the rows with “Overall” subtotals that sum the data for other rows representing the same week. This will leave a set of rows for each week that provides data for distinct demographic subgroups.

We’ll also filter out rows with missing data for the death rate, since these rows will have to be excluded when calculating a mean.

# WEEKLY DEATH RATE DATA
# Exclude rows that summarize overall subtotals for each week
# Exclude rows with missing case rate (NA)
weekly_death_data <- covid |>
  filter(age_group != "Overall" & sex != "Overall" & race_ethnicity_combined != "Overall") |>
  filter(!is.na(death_crude_rate_suppressed_per_100k))

Let’s create a boxplot to see how the distribution of weekly death rates compares for females versus males.

# Boxplot of Weekly Death Rates by Sex At Birth
weekly_death_data |>
  ggplot() +
  geom_boxplot(mapping = aes(x = death_crude_rate_suppressed_per_100k, y = sex, fill = sex), show.legend = FALSE) +
  labs(title = "COVID-19 Weekly Death Rates by Sex (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
       x = "Weekly Death Rate", y = "Sex At Birth") +
  scale_y_discrete(limits = unique(rev(weekly_death_data$sex))) +
  theme_minimal() +
  theme(plot.subtitle = element_text(colour = "darkgray"))

The boxplot shows that while most of the weekly death rates have lower values (less than 25), there are numerous outliers extending to the right, representing weekly death rates with much higher values (up to 100 or more).

Null Hypothesis and Alternative Hypothesis

The null hypothesis to be tested is:

\(H_0: \text{Average weekly death rate is equal for both sexes.}\)

The alternative hypothesis is:

\(H_a: \text{Average weekly death rate is not equal for both sexes.}\)

As we did previously, we’ll attempt to determine if the null hypothesis should be rejected. If we were to reject the null hypothesis as false, we would logically assume the alternative hypothesis should be accepted instead.

Here are the observed mean weekly death rates by sex calculated across the entire dataset:

# Observed Mean Weekly Death Rates by Sex At Birth
avg_death_rates_by_sex <- weekly_death_data |>
  group_by(sex) |>
  summarise(num_data = n(),
            mean_death_rate = mean(death_crude_rate_suppressed_per_100k),
            sd_death_rate = sd(death_crude_rate_suppressed_per_100k)
            ) |>
  arrange(sex)

avg_death_rates_by_sex |>
  select(c(sex, mean_death_rate))
## # A tibble: 2 × 2
##   sex    mean_death_rate
##   <chr>            <dbl>
## 1 Female            9.32
## 2 Male             12.8

The observed difference in the mean death rates by sex is:

# Observed Difference in Mean Weekly Death Rates
observed_diff_death_rate <- avg_death_rates_by_sex$mean_death_rate[2] - avg_death_rates_by_sex$mean_death_rate[1]

paste("Observed Difference in Mean Death Rate (Male - Female) =", round(observed_diff_death_rate, 2))
## [1] "Observed Difference in Mean Death Rate (Male - Female) = 3.46"

Parameters for Hypothesis Testing

For this hypothesis testing, we’ll also use an alpha level (\(\alpha\)) of 0.01 to determine whether to reject the null hypothesis as false. Choosing a lower alpha level of 0.01 will reduce the false negative rate, which is the likelihood of rejecting the null hypothesis when it is actually true (Type I Error).

As before, we’ll use a power level (\(1 - \beta\)) of 0.9 which represents the probability of correctly rejecting a false null hypothesis. The value of \(\beta\) represents the false positive rate, meaning the likelihood of failing to reject the null hypothesis when it is actually false (Type II Error). Choosing a higher power level will help reduce the false positive rate.

The minimum effect size will be calculated from the dataset using Cohen’s d since we don’t have any particular reason to argue for selecting another value.

# Caclculate effect size using Cohen's d function from effsize library
cohen.d(d = filter(weekly_death_data, sex == 'Female') |> pluck("death_crude_rate_suppressed_per_100k"),
        f = filter(weekly_death_data, sex == 'Male') |> pluck("death_crude_rate_suppressed_per_100k"))
## 
## Cohen's d
## 
## d estimate: -0.1949563 (negligible)
## 95 percent confidence interval:
##      lower      upper 
## -0.2724663 -0.1174463

Minimum Sample Size

The table below shows the actual number of data rows for each sex, along with their mean death rates and the standard deviation for their death rates. (There are far fewer rows for death rate data because more rows were removed due to missing data for death rate.)

avg_death_rates_by_sex
## # A tibble: 2 × 4
##   sex    num_data mean_death_rate sd_death_rate
##   <chr>     <int>           <dbl>         <dbl>
## 1 Female     1190            9.32          14.2
## 2 Male       1399           12.8           20.3

We can use the values for our alpha level, power level, and effect size (Cohen’s d) to calculate the minimum sample size needed to test the difference in the two means.

Since we’re using Cohen’s d for mu1, we can set sd1 to 1. Kappa represents the ratio of the actual sample sizes (\(n_{1}/n_{2}\)) for the two samples, which were obtained from the table above.

# Get sample size using pwrss.t.2means from pwrss library
pwrss.t.2means(mu1 = -0.1949563,
                       sd1 = 1,
                       kappa = 1190/1399,
                       power = 0.9, alpha = 0.01, 
                       alternative = "not equal")
##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.9 
##   n1 = 726 
##   n2 = 854 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 1578 
##  Non-centrality parameter = -3.862 
##  Type I error rate = 0.01 
##  Type II error rate = 0.1

The calculation for minimum sample size indicates that we will need a minimum of 726 data points for females (\(n_{1}\)) and 854 data points for males (\(n_{2}\)), so our existing dataset is sufficient for both samples.

Neyman-Pearson Hypothesis Test

Now can proceed with our hypothesis test. Again, because we’re comparing the means of two independent samples, we’ll use a two-tailed t-test to calculate the p-value for our observed difference in means.

# Welch's t-test (compare means of two independent samples)
t.test(death_crude_rate_suppressed_per_100k ~ sex, data = weekly_death_data, alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  death_crude_rate_suppressed_per_100k by sex
## t = -5.0813, df = 2500.2, p-value = 4.025e-07
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -4.799771 -2.126775
## sample estimates:
## mean in group Female   mean in group Male 
##             9.319622            12.782895

The calculated p-value from the t-test is much less than our established alpha level of 0.01, which would lead us to reject the null hypothesis as false and assume that the alternative hypothesis is true: that the average weekly death rates by sex are not equal.

Fisher’s Significance Test

Once again, let’s use bootstrapping as another method to calculate a p-value for our observed difference in mean death rates by generating a simulated sampling distribution of our dataset.

We can re-use our generic bootstrap function. However, we need to create a function to calculate the difference in mean death rates by sex for each simulated sample. This function will be passed into the bootstrap function as an argument.

# Function to calculate difference in mean death rates by sex for random sample
# Will be passed into bootstrap function as argument
diff_avg_death_rate <- function (x_data) {
  avg_death_rates <- x_data |>
    group_by(sex) |>
    summarize(mean_death_rate = mean(death_crude_rate_suppressed_per_100k)) |>
    arrange(sex)
  
  # difference in mean death rate (Male - Female)
  diff <- (avg_death_rates$mean_death_rate[2] - 
           avg_death_rates$mean_death_rate[1])
  
  return(diff)
}

Next, we can run our second bootstrapping simulation.

diffs_avg_death_rates <- bootstrap(weekly_death_data, diff_avg_death_rate, n_iter = 100)

Now we can plot our simulated sampling distribution with a red line added to indicate our observed difference in mean death rates by sex from the original dataset.

f_sampling <- function(x) dnorm(x, mean = 0, 
                                sd = sd(diffs_avg_death_rates))

ggplot() +
  stat_function(mapping = aes(fill = 'More extreme samples'),
                fun = f_sampling, 
                xlim = c(abs(observed_diff_death_rate), 50),
                geom = "area") +
  stat_function(mapping = aes(fill = 'More extreme samples'),
                fun = f_sampling, 
                xlim = c(-50, -abs(observed_diff_death_rate)),
                geom = "area") +
  geom_function(xlim = c(-50, 50), 
                fun = f_sampling) +
  geom_vline(mapping = aes(xintercept = observed_diff_death_rate,
                           color = paste("Observed: ",
                                         round(observed_diff_death_rate, 2)))) +
  labs(title = "Bootstrapped Sampling Distribution of Mean Weekly Death Rate Differences",
       x = "Difference in Mean Weekly Death Rate",
       y = "Probability Density",
       color = "",
       fill = "") +
  scale_x_continuous(breaks = seq(-50, 50, 10)) +
  scale_fill_manual(values = 'lightblue') +
  theme_minimal()

The shaded areas in the plot above represent samples that would have resulted in a more extreme difference in means. This represents the p-value, which is the proportion of samples in which the difference in means would be more extreme than our observed difference.

# "demean" the bootstrapped samples to simulate mu = 0
diffs_avg_death_rates_d <- diffs_avg_death_rates - mean(diffs_avg_death_rates)

# proportion of times the difference is more extreme
paste("p-value =", 
      sum(abs(observed_diff_death_rate) < abs(diffs_avg_death_rates_d)) /
        length(diffs_avg_death_rates_d))
## [1] "p-value = 0.6"

This p-value is much greater than our alpha level (0.01) and indicates that it would be very likely to have selected another sample with a more extreme difference in means. This calculation of p-value would lead us to not reject the null hypothesis and instead assume the null hypothesis is true: that the average weekly death rate is equal for both sexes.


Conclusion

The two different approaches to hypothesis testing produced very different p-values (one very low, another very high) which led to opposite recommendations of whether to reject the null hypothesis.

My first reaction is to wonder if I’ve made a mistake. However, my second reaction is to believe that the simulated sampling through bootstrapping is perhaps a more valid approach. Both the weekly case rates and weekly death rates had wide variation in values, so the relatively small differences observed in their means for this particular dataset are not unexpected.

A traditional research approach would have likely utilized the t-test and determined that the null hypotheses should be rejected in favor of the alternative hypotheses.

However, a data science approach would be more likely to utilize bootstrapping to simulate the selection of numerous random samples and would recognize that the observed sample is not different enough to justify rejecting the null hypothesis.


This dataset will be explored further in subsequent data dives.