COVID-19 Weekly Cases and Deaths by Age, Race/Ethnicity, and Sex (Mar 2020 - Nov 2023)

This data dive will focus on random sampling of a CDC dataset that contains weekly counts and rates of COVID-19 cases and deaths reported in Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, and Wisconsin) of the United States from March 7, 2020 through November 18, 2023.

In addition to overall weekly counts and rates, the weekly data is broken out by age group, race/ethnicity group, and sex at birth.
The CDC suppressed (excluded) any weekly counts with five or fewer cases or deaths, in order to protect the confidentiality of patients.
Additional background can be found in the Week 2 Data Dive for this dataset or the CDC documentation for the full dataset (all US regions).

Prepare Dataset

To get started, let’s load tidyverse and ggthemes to assist with our data analysis.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)

Next, let’s read in the dataset from CSV.

covid <- read_delim("./COVID_weekly_cases_deaths_region5.csv", delim = ",")

## Rows: 37867 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): end_of_week, jurisdiction, age_group, sex, race_ethnicity_combined
## dbl (4): case_count_suppressed, death_count_suppressed, case_crude_rate_supp...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The data for “End of Week” will be converted from a character format to a date format to help with analyzing and visualizing the data properly.

covid$end_of_week <- as.Date(covid$end_of_week, format="%m/%d/%y")

Let’s rename some of the columns to make them more concise (for coding convenience).

covid <- covid |>
  rename(race_ethnicity = race_ethnicity_combined,
         case_count = case_count_suppressed,
         death_count = death_count_suppressed,
         case_rate = case_crude_rate_suppressed_per_100k,
         death_rate = death_crude_rate_suppressed_per_100k
  )

Finally, let’s add a new column to help order the age groups correctly, and then arrange all the rows in nested order by week, age, sex, and race/ethnicity.

# Add new column to order age groups correctly
covid <- covid |>
  mutate(age_order = case_when(
    age_group == "Overall" ~ 0L,
    age_group == "0 - 4 Years" ~ 1L,
    age_group == "5 - 11 Years" ~ 2L,
    age_group == "12 - 15 Years" ~ 3L,
    age_group == "16 - 17 Years" ~ 4L,
    age_group == "18 - 29 Years" ~ 5L,
    age_group == "30 - 39 Years" ~ 6L,
    age_group == "40 - 49 Years" ~ 7L,
    age_group == "50 - 64 Years" ~ 8L,
    age_group == "65 - 74 Years" ~ 9L,
    age_group == "75+ Years" ~ 10L),
    .after = age_group) |>
  arrange(end_of_week, age_order, sex, race_ethnicity)

Let’s save this version of the dataset to avoid having to duplicate these preparation steps in future data dives.

write_delim(covid, "./COVID_region5_prepped.csv", delim = ",")

Random Sampling

The focus of this data dive is to generate and examine random samples of the dataset.

However, first we’ll want to filter out the rows that provide summary counts for the demographic subgroups. These rows with “Overall” counts summarize other sets of rows within the dataset. While these rows with “Overall” counts are useful for other purposes, they should be filtered out for our random sampling.

# Filter data to focus on demographic subgroups for each week by excluding summary rows ("Overall") 
weekly_data <- covid |>
  filter(age_group != "Overall" & sex != "Overall" & race_ethnicity != "Overall")

# Get count of number of rows in filtered dataset
nrow(weekly_data)

## [1] 18969

The filtered dataset contains 18,969 rows of data.

Each row represents the data for a distinct demographic subgroup from a particular week. Each demographic subgroup is a unique combination of age group, sex, and race/ethnicity group.

For example, a given row might include COVID-19 data for the week ending 03/07/2020 for Hispanic females ages 18-29 years.

Let’s view a random sample of 10 rows from this dataset to see some example data.

sample_n(weekly_data, 10)

## # A tibble: 10 × 10
##    end_of_week jurisdiction age_group  age_order sex   race_ethnicity case_count
##    <date>      <chr>        <chr>          <int> <chr> <chr>               <dbl>
##  1 2022-04-23  Region 5     40 - 49 Y…         7 Fema… White, NH            3739
##  2 2020-06-13  Region 5     30 - 39 Y…         6 Fema… Hispanic              314
##  3 2021-10-16  Region 5     50 - 64 Y…         8 Fema… Hispanic              270
##  4 2023-01-21  Region 5     75+ Years         10 Male  Hispanic               51
##  5 2020-12-19  Region 5     16 - 17 Y…         4 Fema… White, NH            1025
##  6 2023-09-09  Region 5     18 - 29 Y…         5 Fema… Black, NH             527
##  7 2022-01-22  Region 5     16 - 17 Y…         4 Fema… Black, NH             525
##  8 2023-09-09  Region 5     12 - 15 Y…         3 Fema… White, NH             297
##  9 2021-09-11  Region 5     16 - 17 Y…         4 Male  Hispanic              163
## 10 2023-02-11  Region 5     5 - 11 Ye…         2 Male  Hispanic              149
## # ℹ 3 more variables: death_count <dbl>, case_rate <dbl>, death_rate <dbl>

Generate Random Samples

Now let’s generate 5 random samples of the dataset. Each sample will include 9500 rows, which is about half the number of rows in the complete dataset. The rows will be randomly sampled with replacement, meaning that a given row could be randomly selected more than once for the same sample.

# Create 5 random samples of 9500 rows (with replacement) from the dataset
random_df_1 <- sample_n(weekly_data, 9500, replace = TRUE)
random_df_2 <- sample_n(weekly_data, 9500, replace = TRUE)
random_df_3 <- sample_n(weekly_data, 9500, replace = TRUE)
random_df_4 <- sample_n(weekly_data, 9500, replace = TRUE)
random_df_5 <- sample_n(weekly_data, 9500, replace = TRUE)

Compare Summary Statistics for Random Samples

Let’s compare the summary statistics for the complete dataset and the five samples.

# Summary statistics for complete dataset
weekly_data |>
  summarise(mean_cases = round(mean(case_count, na.rm = TRUE), digits = 2),
            mean_deaths = round(mean(death_count, na.rm = TRUE), digits = 2),
            mean_case_rate = round(mean(case_rate, na.rm = TRUE), digits = 2),
            mean_death_rate = round(mean(death_rate, na.rm = TRUE), digits = 2)
  )

## # A tibble: 1 × 4
##   mean_cases mean_deaths mean_case_rate mean_death_rate
##        <dbl>       <dbl>          <dbl>           <dbl>
## 1       668.        61.8           128.            11.2

# Dataframe combining all the random samples
random_samples <- bind_rows(
  mutate(random_df_1, sample_set = "Random Sample 1"),
  mutate(random_df_2, sample_set = "Random Sample 2"),
  mutate(random_df_3, sample_set = "Random Sample 3"),
  mutate(random_df_4, sample_set = "Random Sample 4"),
  mutate(random_df_5, sample_set = "Random Sample 5")
  )

# Summary statistics grouped by sample set
random_samples |>
  group_by(sample_set) |>
  summarise(mean_cases = round(mean(case_count, na.rm = TRUE), digits = 2),
            mean_deaths = round(mean(death_count, na.rm = TRUE), digits = 2),
            mean_case_rate = round(mean(case_rate, na.rm = TRUE), digits = 2),
            mean_death_rate = round(mean(death_rate, na.rm = TRUE), digits = 2)
  )

## # A tibble: 5 × 5
##   sample_set      mean_cases mean_deaths mean_case_rate mean_death_rate
##   <chr>                <dbl>       <dbl>          <dbl>           <dbl>
## 1 Random Sample 1       671.        60.7           129.            11.1
## 2 Random Sample 2       689.        61.8           131.            11.4
## 3 Random Sample 3       673.        60.0           128.            10.7
## 4 Random Sample 4       680.        67.7           129.            11.2
## 5 Random Sample 5       644.        64.5           126.            11.3

In general, the summary statistics for the random samples are similar to each other and similar to the complete dataset, though there is some minor variation among the samples.

To scrutinize these samples further, let’s examine how they compare when their data are grouped by the various demographic variables in the dataset: age group, sex at birth, and race/ethnicity group.

Comparing Samples by Age Group

Complete Dataset - Summary by Age Groups

Let’s calculate the summary statistics by age group for the complete dataset.

# Complete Dataset - Age Groups - Summary Statistics
weekly_data |>
  group_by(age_group) |>
  summarise(age_order = mean(age_order),
            mean_cases = mean(case_count, na.rm = TRUE),
            mean_deaths = mean(death_count, na.rm = TRUE),
            mean_case_rate = mean(case_rate, na.rm = TRUE),
            mean_death_rate = mean(death_rate, na.rm = TRUE)
  ) |>
  arrange(desc(age_order), .by_group = TRUE)

## # A tibble: 10 × 6
##    age_group     age_order mean_cases mean_deaths mean_case_rate mean_death_rate
##    <chr>             <dbl>      <dbl>       <dbl>          <dbl>           <dbl>
##  1 75+ Years            10       534.       111.           127.           22.9  
##  2 65 - 74 Years         9       562.        49.5          115.            9.26 
##  3 50 - 64 Years         8      1184.        38.1          123.            3.44 
##  4 40 - 49 Years         7       849.        15.4          143.            2.05 
##  5 30 - 39 Years         6       975.        10.6          156.            1.04 
##  6 18 - 29 Years         5      1199.         8            149.            0.400
##  7 16 - 17 Years         4       193.       NaN            147.          NaN    
##  8 12 - 15 Years         3       299.       NaN            122.          NaN    
##  9 5 - 11 Years          2       416.       NaN            106.          NaN    
## 10 0 - 4 Years           1       246.       NaN             92.8         NaN

Let’s compare the random samples in terms of their summary statistics by age group.

Random Sample 1 - Summary by Age Groups

# Random Sample 1 - Age Groups - Summary Statistics
random_df_1 |>
  group_by(age_group) |>
  summarise(age_order = mean(age_order),
            mean_cases = mean(case_count, na.rm = TRUE),
            mean_deaths = mean(death_count, na.rm = TRUE),
            mean_case_rate = mean(case_rate, na.rm = TRUE),
            mean_death_rate = mean(death_rate, na.rm = TRUE)
  ) |>
  arrange(desc(age_order), .by_group = TRUE)

## # A tibble: 10 × 6
##    age_group     age_order mean_cases mean_deaths mean_case_rate mean_death_rate
##    <chr>             <dbl>      <dbl>       <dbl>          <dbl>           <dbl>
##  1 75+ Years            10       504.       106.            132.          23.9  
##  2 65 - 74 Years         9       636.        53.9           109.           8.55 
##  3 50 - 64 Years         8      1159.        39.0           123.           3.60 
##  4 40 - 49 Years         7       892.        16.2           133.           1.91 
##  5 30 - 39 Years         6       987.        11.7           155.           1.05 
##  6 18 - 29 Years         5      1085.         7.6           154.           0.258
##  7 16 - 17 Years         4       189.       NaN             143.         NaN    
##  8 12 - 15 Years         3       343.       NaN             125.         NaN    
##  9 5 - 11 Years          2       417.       NaN             108.         NaN    
## 10 0 - 4 Years           1       260.       NaN             102.         NaN

Random Sample 2 - Summary by Age Groups

# Random Sample 2 - Age Groups - Summary Statistics
random_df_2 |>
  group_by(age_group) |>
  summarise(age_order = mean(age_order),
            mean_cases = mean(case_count, na.rm = TRUE),
            mean_deaths = mean(death_count, na.rm = TRUE),
            mean_case_rate = mean(case_rate, na.rm = TRUE),
            mean_death_rate = mean(death_rate, na.rm = TRUE)
  ) |>
  arrange(desc(age_order), .by_group = TRUE)

## # A tibble: 10 × 6
##    age_group     age_order mean_cases mean_deaths mean_case_rate mean_death_rate
##    <chr>             <dbl>      <dbl>       <dbl>          <dbl>           <dbl>
##  1 75+ Years            10       549.       108.            128.          23.2  
##  2 65 - 74 Years         9       631.        54.3           119.           9.30 
##  3 50 - 64 Years         8      1219.        36.2           125.           3.29 
##  4 40 - 49 Years         7       913.        15.6           151.           2.19 
##  5 30 - 39 Years         6       947.        10.9           157.           1.03 
##  6 18 - 29 Years         5      1230.         8             149.           0.552
##  7 16 - 17 Years         4       200.       NaN             152.         NaN    
##  8 12 - 15 Years         3       280.       NaN             116.         NaN    
##  9 5 - 11 Years          2       401.       NaN             104.         NaN    
## 10 0 - 4 Years           1       253.       NaN             101.         NaN

Random Sample 3 - Summary by Age Groups

# Random Sample 3 - Age Groups - Summary Statistics
random_df_3 |>
  group_by(age_group) |>
  summarise(age_order = mean(age_order),
            mean_cases = mean(case_count, na.rm = TRUE),
            mean_deaths = mean(death_count, na.rm = TRUE),
            mean_case_rate = mean(case_rate, na.rm = TRUE),
            mean_death_rate = mean(death_rate, na.rm = TRUE)
  ) |>
  arrange(desc(age_order), .by_group = TRUE)

## # A tibble: 10 × 6
##    age_group     age_order mean_cases mean_deaths mean_case_rate mean_death_rate
##    <chr>             <dbl>      <dbl>       <dbl>          <dbl>           <dbl>
##  1 75+ Years            10       540.       110.           126.           22.2  
##  2 65 - 74 Years         9       598.        45.3          116.            8.33 
##  3 50 - 64 Years         8      1096.        40.1          128.            3.83 
##  4 40 - 49 Years         7       941.        15.4          148.            1.92 
##  5 30 - 39 Years         6       878.        10.5          151.            1.32 
##  6 18 - 29 Years         5      1262.         7.3          150.            0.416
##  7 16 - 17 Years         4       196.       NaN            143.          NaN    
##  8 12 - 15 Years         3       280.       NaN            114.          NaN    
##  9 5 - 11 Years          2       418.       NaN            106.          NaN    
## 10 0 - 4 Years           1       245.       NaN             95.3         NaN

Random Sample 4 - Summary by Age Groups

# Random Sample 4 - Age Groups - Summary Statistics
random_df_4 |>
  group_by(age_group) |>
  summarise(age_order = mean(age_order),
            mean_cases = mean(case_count, na.rm = TRUE),
            mean_deaths = mean(death_count, na.rm = TRUE),
            mean_case_rate = mean(case_rate, na.rm = TRUE),
            mean_death_rate = mean(death_rate, na.rm = TRUE)
  ) |>
  arrange(desc(age_order), .by_group = TRUE)

## # A tibble: 10 × 6
##    age_group     age_order mean_cases mean_deaths mean_case_rate mean_death_rate
##    <chr>             <dbl>      <dbl>       <dbl>          <dbl>           <dbl>
##  1 75+ Years            10       530.      126.            125.           22.9  
##  2 65 - 74 Years         9       510.       48.4           113.            9.29 
##  3 50 - 64 Years         8      1114.       42.2           124.            3.57 
##  4 40 - 49 Years         7       917.       16.7           144.            2.09 
##  5 30 - 39 Years         6      1040.        9.93          163.            0.997
##  6 18 - 29 Years         5      1224.        8.08          153.            0.281
##  7 16 - 17 Years         4       212.      NaN             145.          NaN    
##  8 12 - 15 Years         3       321.      NaN             126.          NaN    
##  9 5 - 11 Years          2       459.      NaN             108.          NaN    
## 10 0 - 4 Years           1       289       NaN              88.1         NaN

Random Sample 5 - Summary by Age Groups

# Random Sample 5 - Age Groups - Summary Statistics
random_df_5 |>
  group_by(age_group) |>
  summarise(age_order = mean(age_order),
            mean_cases = mean(case_count, na.rm = TRUE),
            mean_deaths = mean(death_count, na.rm = TRUE),
            mean_case_rate = mean(case_rate, na.rm = TRUE),
            mean_death_rate = mean(death_rate, na.rm = TRUE)
  ) |>
  arrange(desc(age_order), .by_group = TRUE)

## # A tibble: 10 × 6
##    age_group     age_order mean_cases mean_deaths mean_case_rate mean_death_rate
##    <chr>             <dbl>      <dbl>       <dbl>          <dbl>           <dbl>
##  1 75+ Years            10       538.      113.            128.            22.9 
##  2 65 - 74 Years         9       589.       55.2           113.             8.75
##  3 50 - 64 Years         8      1065.       37.9           115.             3.54
##  4 40 - 49 Years         7       867.       16.4           144.             2.11
##  5 30 - 39 Years         6       933.       11.0           152.             1.12
##  6 18 - 29 Years         5      1140.        7.62          148.             0.37
##  7 16 - 17 Years         4       214.      NaN             150.           NaN   
##  8 12 - 15 Years         3       336.      NaN             119.           NaN   
##  9 5 - 11 Years          2       343.      NaN              98.1          NaN   
## 10 0 - 4 Years           1       244.      NaN              93.3          NaN

In general, the mean statistics by age group for each random sample are similar to each other and similar to the complete dataset, though there is some variation among the samples. The same general patterns seen in the complete dataset are seen in the samples, though specific values might vary.

Comparing Samples by Sex At Birth

Complete Dataset - Summary by Sex At Birth

Let’s calculate the summary statistics by sex at birth for the complete dataset.

# Complete Dataset - Sex At Birth - Summary Statistics
weekly_data |>
  group_by(sex) |>
  summarise(mean_cases = mean(case_count, na.rm = TRUE),
            mean_deaths = mean(death_count, na.rm = TRUE),
            mean_case_rate = mean(case_rate, na.rm = TRUE),
            mean_death_rate = mean(death_rate, na.rm = TRUE)
  )

## # A tibble: 2 × 5
##   sex    mean_cases mean_deaths mean_case_rate mean_death_rate
##   <chr>       <dbl>       <dbl>          <dbl>           <dbl>
## 1 Female       734.        62.1           137.            9.32
## 2 Male         602.        61.5           120.           12.8

Let’s compare the random samples in terms of their summary statistics by sex at birth.

Random Sample 1 - Summary by Sex At Birth

# Random Sample 1 - Sex At Birth - Summary Statistics
random_df_1 |>
  group_by(sex) |>
  summarise(mean_cases = mean(case_count, na.rm = TRUE),
            mean_deaths = mean(death_count, na.rm = TRUE),
            mean_case_rate = mean(case_rate, na.rm = TRUE),
            mean_death_rate = mean(death_rate, na.rm = TRUE)
  )

## # A tibble: 2 × 5
##   sex    mean_cases mean_deaths mean_case_rate mean_death_rate
##   <chr>       <dbl>       <dbl>          <dbl>           <dbl>
## 1 Female       736.        56.5           139.            9.62
## 2 Male         607.        64.1           119.           12.3

Random Sample 2 - Summary by Sex At Birth

# Random Sample 2 - Sex At Birth - Summary Statistics
random_df_2 |>
  group_by(sex) |>
  summarise(mean_cases = mean(case_count, na.rm = TRUE),
            mean_deaths = mean(death_count, na.rm = TRUE),
            mean_case_rate = mean(case_rate, na.rm = TRUE),
            mean_death_rate = mean(death_rate, na.rm = TRUE)
  )

## # A tibble: 2 × 5
##   sex    mean_cases mean_deaths mean_case_rate mean_death_rate
##   <chr>       <dbl>       <dbl>          <dbl>           <dbl>
## 1 Female       774.        66.5           140.            8.64
## 2 Male         608.        58.2           121.           13.5

Random Sample 3 - Summary by Sex At Birth

# Random Sample 3 - Sex At Birth - Summary Statistics
random_df_3 |>
  group_by(sex) |>
  summarise(mean_cases = mean(case_count, na.rm = TRUE),
            mean_deaths = mean(death_count, na.rm = TRUE),
            mean_case_rate = mean(case_rate, na.rm = TRUE),
            mean_death_rate = mean(death_rate, na.rm = TRUE)
  )

## # A tibble: 2 × 5
##   sex    mean_cases mean_deaths mean_case_rate mean_death_rate
##   <chr>       <dbl>       <dbl>          <dbl>           <dbl>
## 1 Female       736.        60.6           139.            8.64
## 2 Male         610.        59.6           118.           12.3

Random Sample 4 - Summary by Sex At Birth

# Random Sample 4 - Sex At Birth - Summary Statistics
random_df_4 |>
  group_by(sex) |>
  summarise(mean_cases = mean(case_count, na.rm = TRUE),
            mean_deaths = mean(death_count, na.rm = TRUE),
            mean_case_rate = mean(case_rate, na.rm = TRUE),
            mean_death_rate = mean(death_rate, na.rm = TRUE)
  )

## # A tibble: 2 × 5
##   sex    mean_cases mean_deaths mean_case_rate mean_death_rate
##   <chr>       <dbl>       <dbl>          <dbl>           <dbl>
## 1 Female       731.        60.8           139.            9.14
## 2 Male         626.        73.8           120.           13.1

Random Sample 5 - Summary by Sex At Birth

# Random Sample 5 - Sex At Birth - Summary Statistics
random_df_5 |>
  group_by(sex) |>
  summarise(mean_cases = mean(case_count, na.rm = TRUE),
            mean_deaths = mean(death_count, na.rm = TRUE),
            mean_case_rate = mean(case_rate, na.rm = TRUE),
            mean_death_rate = mean(death_rate, na.rm = TRUE)
  )

## # A tibble: 2 × 5
##   sex    mean_cases mean_deaths mean_case_rate mean_death_rate
##   <chr>       <dbl>       <dbl>          <dbl>           <dbl>
## 1 Female       698.        61.9           134.            9.60
## 2 Male         588.        66.7           118.           12.6

Again, in general, the mean statistics by sex at birth for each random sample are similar to each other and similar to the complete dataset, though there is some variation in the samples.

Comparing Samples by Race/Ethnicity

Complete Dataset - Summary by Race/Ethnicity

Let’s calculate the summary statistics by race/ethnicity group for the complete dataset.

# Complete Dataset - Race/Ethnicity Group - Summary Statistics
weekly_data |>
  group_by(race_ethnicity) |>
  summarise(mean_cases = mean(case_count, na.rm = TRUE),
            mean_deaths = mean(death_count, na.rm = TRUE),
            mean_case_rate = mean(case_rate, na.rm = TRUE),
            mean_death_rate = mean(death_rate, na.rm = TRUE)
  )

## # A tibble: 5 × 5
##   race_ethnicity mean_cases mean_deaths mean_case_rate mean_death_rate
##   <chr>               <dbl>       <dbl>          <dbl>           <dbl>
## 1 AI/AN, NH            26.1         7.2          217.            88.0 
## 2 Asian/PI, NH        102.         10.4           95.7           24.7 
## 3 Black, NH           368.         24.8          115.            12.6 
## 4 Hispanic            318.         16.2          137.            19.5 
## 5 White, NH          2220.        106.           110.             5.72

Let’s compare the random samples in terms of their summary statistics by race/ethnicity group.

Random Sample 1 - Summary by Race/Ethnicity

# Random Sample 1 - Race/Ethnicity - Summary Statistics
random_df_1 |>
  group_by(race_ethnicity) |>
  summarise(mean_cases = mean(case_count, na.rm = TRUE),
            mean_deaths = mean(death_count, na.rm = TRUE),
            mean_case_rate = mean(case_rate, na.rm = TRUE),
            mean_death_rate = mean(death_rate, na.rm = TRUE)
  )

## # A tibble: 5 × 5
##   race_ethnicity mean_cases mean_deaths mean_case_rate mean_death_rate
##   <chr>               <dbl>       <dbl>          <dbl>           <dbl>
## 1 AI/AN, NH            26.6        6             220.            60.2 
## 2 Asian/PI, NH        109.         9.58           96.3           22.5 
## 3 Black, NH           371.        26.1           117.            13.5 
## 4 Hispanic            305.        18.1           132.            22.2 
## 5 White, NH          2234.        97.9           113.             5.24

Random Sample 2 - Summary by Race/Ethnicity

# Random Sample 2 - Race/Ethnicity - Summary Statistics
random_df_2 |>
  group_by(race_ethnicity) |>
  summarise(mean_cases = mean(case_count, na.rm = TRUE),
            mean_deaths = mean(death_count, na.rm = TRUE),
            mean_case_rate = mean(case_rate, na.rm = TRUE),
            mean_death_rate = mean(death_rate, na.rm = TRUE)
  )

## # A tibble: 5 × 5
##   race_ethnicity mean_cases mean_deaths mean_case_rate mean_death_rate
##   <chr>               <dbl>       <dbl>          <dbl>           <dbl>
## 1 AI/AN, NH            26.7      NaN             224.           NaN   
## 2 Asian/PI, NH        104.         9.70           95.3           23.7 
## 3 Black, NH           366.        24.1           116.            12.1 
## 4 Hispanic            331.        17.2           142.            22.9 
## 5 White, NH          2278.       105.            112.             5.66

Random Sample 3 - Summary by Race/Ethnicity

# Random Sample 3 - Race/Ethnicity - Summary Statistics
random_df_3 |>
  group_by(race_ethnicity) |>
  summarise(mean_cases = mean(case_count, na.rm = TRUE),
            mean_deaths = mean(death_count, na.rm = TRUE),
            mean_case_rate = mean(case_rate, na.rm = TRUE),
            mean_death_rate = mean(death_rate, na.rm = TRUE)
  )

## # A tibble: 5 × 5
##   race_ethnicity mean_cases mean_deaths mean_case_rate mean_death_rate
##   <chr>               <dbl>       <dbl>          <dbl>           <dbl>
## 1 AI/AN, NH            26.6       NaN            214.           NaN   
## 2 Asian/PI, NH        103.         10.1           98.2           23.6 
## 3 Black, NH           384.         22.7          115.            11.2 
## 4 Hispanic            315.         16.1          136.            19.3 
## 5 White, NH          2174.        105.           108.             5.71

Random Sample 4 - Summary by Race/Ethnicity

# Random Sample 4 - Race/Ethnicity - Summary Statistics
random_df_4 |>
  group_by(race_ethnicity) |>
  summarise(mean_cases = mean(case_count, na.rm = TRUE),
            mean_deaths = mean(death_count, na.rm = TRUE),
            mean_case_rate = mean(case_rate, na.rm = TRUE),
            mean_death_rate = mean(death_rate, na.rm = TRUE)
  )

## # A tibble: 5 × 5
##   race_ethnicity mean_cases mean_deaths mean_case_rate mean_death_rate
##   <chr>               <dbl>       <dbl>          <dbl>           <dbl>
## 1 AI/AN, NH            25.8         7.5          212.            70.8 
## 2 Asian/PI, NH        109.         10.2           98.2           24.6 
## 3 Black, NH           371.         21.3          115.             9.62
## 4 Hispanic            308.         16.2          135.            20.7 
## 5 White, NH          2259.        123.           117.             6.85

Random Sample 5 - Summary by Race/Ethnicity

# Random Sample 5 - Race/Ethnicity - Summary Statistics
random_df_5 |>
  group_by(race_ethnicity) |>
  summarise(mean_cases = mean(case_count, na.rm = TRUE),
            mean_deaths = mean(death_count, na.rm = TRUE),
            mean_case_rate = mean(case_rate, na.rm = TRUE),
            mean_death_rate = mean(death_rate, na.rm = TRUE)
  )

## # A tibble: 5 × 5
##   race_ethnicity mean_cases mean_deaths mean_case_rate mean_death_rate
##   <chr>               <dbl>       <dbl>          <dbl>           <dbl>
## 1 AI/AN, NH            25.5        7             211.            47.5 
## 2 Asian/PI, NH         99.9        9.83           92.2           24.1 
## 3 Black, NH           364.        24.8           115.            12.7 
## 4 Hispanic            301.        15.9           130.            19.0 
## 5 White, NH          2140.       112.            110.             6.14

Once again, the mean statistics by race/ethnicity group for the random samples are similar to the complete dataset and similar to each other, though there is minor variation among the samples.

Monte Carlo Simulation and Central Limit Theorem

Having just examined a small number of very large random samples, let’s try the opposite approach: examining a very large number of (relatively) small random samples.

We’ll create a Monte Carlo simulation to select numerous random samples, and then use that sampling distribution to predict a statistic for our population.

We’ll treat the complete dataset as our “population” and generate numerous small random samples from that population. For our statistic, let’s focus on the mean weekly case rate, which represents the average number of COVID-19 cases per 100,000 population.

mean(weekly_data$case_rate, na.rm = TRUE)

## [1] 128.4466

The mean weekly case rate for the population dataset was approximately 128 cases per 100,000 population.

Now let’s create the Monte Carlo simulation. First, let’s define a function that will randomly generate m number of samples of size n from the the population dataset.

# Function to simulate m random samples of size n from population dataset
sim_sampling <- function(n, m) {

  sim <- function (x) mean(sample_n(weekly_data, n, replace = TRUE)$case_rate, na.rm = TRUE)
  
  x <- map_dbl(1:m, sim)
  
  df <- data.frame(
    iter = 1:m,
    sample_mean = x
  )
  return(df)
}

Now let’s use that function to generate 10,000 random samples of size 50, and then plot a distribution of the mean case rates from those samples.

# Calculate population mean to display on plot for comparison
pop_mean <- mean(weekly_data$case_rate, na.rm = TRUE)

# Set parameters for Monte Carlo simulation
n <- 50     # sample size
m <- 10000  # number of samples

# Run Monte Carlo simulation and plot results
mcsim_df <- sim_sampling(n, m)

mcsim_df |>
  ggplot() +
  geom_histogram(mapping = aes(x = sample_mean), binwidth = 2, fill = "#3182bd", colour = "white") +
  geom_vline(xintercept = pop_mean, color = "orange") + 
  annotate("text", x = 155, y = 475, label = "Population Mean", color = 'orange') +
  scale_y_continuous(breaks = c(0, 100, 200, 300, 400, 500, 600)) +
  scale_x_continuous(breaks = c(0, 20, 40, 60, 80, 100, 120, 140, 160, 180, 200)) +
  labs(title = paste(m, "Samples of Size", n, "from Dataset"),
       x = "Sample Mean",
       y = "Number of Samples") +
  theme_hc()

The sampling distribution from our Monte Carlo simulation appears to approximate a normal distribution with its peak similar close to the value of the population mean.

The Central Limit Theorem states that the sampling distribution of a sufficiently large number of independent samples will be approximately normal, regardless of the the population’s actual distribution shape. Furthermore, the mean of the sample means will equal the population mean.

In fact, the calculations below show that the mean of the sample means is very close to the true population mean.

cat("Mean of Sample Means =", mean(mcsim_df$sample_mean), "\n")

## Mean of Sample Means = 128.3196

cat("Actual Population Mean =", pop_mean)

## Actual Population Mean = 128.4466

This Monte Carlo simulation helps demonstrate that a sufficiently large number of relatively small independent random samples of a population can be used to accurately predict the theoretical expected value of a statistic for the population.

This dataset will be explored further in subsequent data dives.

H510 Week 4 Data Dive

Michael Frontz

2024-02-05