This data dive will focus on random sampling of a CDC dataset that contains weekly counts and rates of COVID-19 cases and deaths reported in Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, and Wisconsin) of the United States from March 7, 2020 through November 18, 2023.
To get started, let’s load tidyverse and ggthemes to assist with our data analysis.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.4 âś” readr 2.1.4
## âś” forcats 1.0.0 âś” stringr 1.5.1
## âś” ggplot2 3.4.4 âś” tibble 3.2.1
## âś” lubridate 1.9.3 âś” tidyr 1.3.0
## âś” purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
Next, let’s read in the dataset from CSV.
covid <- read_delim("./COVID_weekly_cases_deaths_region5.csv", delim = ",")
## Rows: 37867 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): end_of_week, jurisdiction, age_group, sex, race_ethnicity_combined
## dbl (4): case_count_suppressed, death_count_suppressed, case_crude_rate_supp...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The data for “End of Week” will be converted from a character format to a date format to help with analyzing and visualizing the data properly.
covid$end_of_week <- as.Date(covid$end_of_week, format="%m/%d/%y")
Let’s rename some of the columns to make them more concise (for coding convenience).
covid <- covid |>
rename(race_ethnicity = race_ethnicity_combined,
case_count = case_count_suppressed,
death_count = death_count_suppressed,
case_rate = case_crude_rate_suppressed_per_100k,
death_rate = death_crude_rate_suppressed_per_100k
)
Finally, let’s add a new column to help order the age groups correctly, and then arrange all the rows in nested order by week, age, sex, and race/ethnicity.
# Add new column to order age groups correctly
covid <- covid |>
mutate(age_order = case_when(
age_group == "Overall" ~ 0L,
age_group == "0 - 4 Years" ~ 1L,
age_group == "5 - 11 Years" ~ 2L,
age_group == "12 - 15 Years" ~ 3L,
age_group == "16 - 17 Years" ~ 4L,
age_group == "18 - 29 Years" ~ 5L,
age_group == "30 - 39 Years" ~ 6L,
age_group == "40 - 49 Years" ~ 7L,
age_group == "50 - 64 Years" ~ 8L,
age_group == "65 - 74 Years" ~ 9L,
age_group == "75+ Years" ~ 10L),
.after = age_group) |>
arrange(end_of_week, age_order, sex, race_ethnicity)
Let’s save this version of the dataset to avoid having to duplicate these preparation steps in future data dives.
write_delim(covid, "./COVID_region5_prepped.csv", delim = ",")
The focus of this data dive is to generate and examine random samples of the dataset.
However, first we’ll want to filter out the rows that provide summary counts for the demographic subgroups. These rows with “Overall” counts summarize other sets of rows within the dataset. While these rows with “Overall” counts are useful for other purposes, they should be filtered out for our random sampling.
# Filter data to focus on demographic subgroups for each week by excluding summary rows ("Overall")
weekly_data <- covid |>
filter(age_group != "Overall" & sex != "Overall" & race_ethnicity != "Overall")
# Get count of number of rows in filtered dataset
nrow(weekly_data)
## [1] 18969
The filtered dataset contains 18,969 rows of data.
Each row represents the data for a distinct demographic subgroup from a particular week. Each demographic subgroup is a unique combination of age group, sex, and race/ethnicity group.
For example, a given row might include COVID-19 data for the week ending 03/07/2020 for Hispanic females ages 18-29 years.
Let’s view a random sample of 10 rows from this dataset to see some example data.
sample_n(weekly_data, 10)
## # A tibble: 10 Ă— 10
## end_of_week jurisdiction age_group age_order sex race_ethnicity case_count
## <date> <chr> <chr> <int> <chr> <chr> <dbl>
## 1 2022-04-23 Region 5 40 - 49 Y… 7 Fema… White, NH 3739
## 2 2020-06-13 Region 5 30 - 39 Y… 6 Fema… Hispanic 314
## 3 2021-10-16 Region 5 50 - 64 Y… 8 Fema… Hispanic 270
## 4 2023-01-21 Region 5 75+ Years 10 Male Hispanic 51
## 5 2020-12-19 Region 5 16 - 17 Y… 4 Fema… White, NH 1025
## 6 2023-09-09 Region 5 18 - 29 Y… 5 Fema… Black, NH 527
## 7 2022-01-22 Region 5 16 - 17 Y… 4 Fema… Black, NH 525
## 8 2023-09-09 Region 5 12 - 15 Y… 3 Fema… White, NH 297
## 9 2021-09-11 Region 5 16 - 17 Y… 4 Male Hispanic 163
## 10 2023-02-11 Region 5 5 - 11 Ye… 2 Male Hispanic 149
## # ℹ 3 more variables: death_count <dbl>, case_rate <dbl>, death_rate <dbl>
Now let’s generate 5 random samples of the dataset. Each sample will include 9500 rows, which is about half the number of rows in the complete dataset. The rows will be randomly sampled with replacement, meaning that a given row could be randomly selected more than once for the same sample.
# Create 5 random samples of 9500 rows (with replacement) from the dataset
random_df_1 <- sample_n(weekly_data, 9500, replace = TRUE)
random_df_2 <- sample_n(weekly_data, 9500, replace = TRUE)
random_df_3 <- sample_n(weekly_data, 9500, replace = TRUE)
random_df_4 <- sample_n(weekly_data, 9500, replace = TRUE)
random_df_5 <- sample_n(weekly_data, 9500, replace = TRUE)
Let’s compare the summary statistics for the complete dataset and the five samples.
# Summary statistics for complete dataset
weekly_data |>
summarise(mean_cases = round(mean(case_count, na.rm = TRUE), digits = 2),
mean_deaths = round(mean(death_count, na.rm = TRUE), digits = 2),
mean_case_rate = round(mean(case_rate, na.rm = TRUE), digits = 2),
mean_death_rate = round(mean(death_rate, na.rm = TRUE), digits = 2)
)
## # A tibble: 1 Ă— 4
## mean_cases mean_deaths mean_case_rate mean_death_rate
## <dbl> <dbl> <dbl> <dbl>
## 1 668. 61.8 128. 11.2
# Dataframe combining all the random samples
random_samples <- bind_rows(
mutate(random_df_1, sample_set = "Random Sample 1"),
mutate(random_df_2, sample_set = "Random Sample 2"),
mutate(random_df_3, sample_set = "Random Sample 3"),
mutate(random_df_4, sample_set = "Random Sample 4"),
mutate(random_df_5, sample_set = "Random Sample 5")
)
# Summary statistics grouped by sample set
random_samples |>
group_by(sample_set) |>
summarise(mean_cases = round(mean(case_count, na.rm = TRUE), digits = 2),
mean_deaths = round(mean(death_count, na.rm = TRUE), digits = 2),
mean_case_rate = round(mean(case_rate, na.rm = TRUE), digits = 2),
mean_death_rate = round(mean(death_rate, na.rm = TRUE), digits = 2)
)
## # A tibble: 5 Ă— 5
## sample_set mean_cases mean_deaths mean_case_rate mean_death_rate
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Random Sample 1 671. 60.7 129. 11.1
## 2 Random Sample 2 689. 61.8 131. 11.4
## 3 Random Sample 3 673. 60.0 128. 10.7
## 4 Random Sample 4 680. 67.7 129. 11.2
## 5 Random Sample 5 644. 64.5 126. 11.3
In general, the summary statistics for the random samples are similar to each other and similar to the complete dataset, though there is some minor variation among the samples.
To scrutinize these samples further, let’s examine how they compare when their data are grouped by the various demographic variables in the dataset: age group, sex at birth, and race/ethnicity group.
Let’s calculate the summary statistics by age group for the complete dataset.
# Complete Dataset - Age Groups - Summary Statistics
weekly_data |>
group_by(age_group) |>
summarise(age_order = mean(age_order),
mean_cases = mean(case_count, na.rm = TRUE),
mean_deaths = mean(death_count, na.rm = TRUE),
mean_case_rate = mean(case_rate, na.rm = TRUE),
mean_death_rate = mean(death_rate, na.rm = TRUE)
) |>
arrange(desc(age_order), .by_group = TRUE)
## # A tibble: 10 Ă— 6
## age_group age_order mean_cases mean_deaths mean_case_rate mean_death_rate
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 75+ Years 10 534. 111. 127. 22.9
## 2 65 - 74 Years 9 562. 49.5 115. 9.26
## 3 50 - 64 Years 8 1184. 38.1 123. 3.44
## 4 40 - 49 Years 7 849. 15.4 143. 2.05
## 5 30 - 39 Years 6 975. 10.6 156. 1.04
## 6 18 - 29 Years 5 1199. 8 149. 0.400
## 7 16 - 17 Years 4 193. NaN 147. NaN
## 8 12 - 15 Years 3 299. NaN 122. NaN
## 9 5 - 11 Years 2 416. NaN 106. NaN
## 10 0 - 4 Years 1 246. NaN 92.8 NaN
Let’s compare the random samples in terms of their summary statistics by age group.
# Random Sample 1 - Age Groups - Summary Statistics
random_df_1 |>
group_by(age_group) |>
summarise(age_order = mean(age_order),
mean_cases = mean(case_count, na.rm = TRUE),
mean_deaths = mean(death_count, na.rm = TRUE),
mean_case_rate = mean(case_rate, na.rm = TRUE),
mean_death_rate = mean(death_rate, na.rm = TRUE)
) |>
arrange(desc(age_order), .by_group = TRUE)
## # A tibble: 10 Ă— 6
## age_group age_order mean_cases mean_deaths mean_case_rate mean_death_rate
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 75+ Years 10 504. 106. 132. 23.9
## 2 65 - 74 Years 9 636. 53.9 109. 8.55
## 3 50 - 64 Years 8 1159. 39.0 123. 3.60
## 4 40 - 49 Years 7 892. 16.2 133. 1.91
## 5 30 - 39 Years 6 987. 11.7 155. 1.05
## 6 18 - 29 Years 5 1085. 7.6 154. 0.258
## 7 16 - 17 Years 4 189. NaN 143. NaN
## 8 12 - 15 Years 3 343. NaN 125. NaN
## 9 5 - 11 Years 2 417. NaN 108. NaN
## 10 0 - 4 Years 1 260. NaN 102. NaN
# Random Sample 2 - Age Groups - Summary Statistics
random_df_2 |>
group_by(age_group) |>
summarise(age_order = mean(age_order),
mean_cases = mean(case_count, na.rm = TRUE),
mean_deaths = mean(death_count, na.rm = TRUE),
mean_case_rate = mean(case_rate, na.rm = TRUE),
mean_death_rate = mean(death_rate, na.rm = TRUE)
) |>
arrange(desc(age_order), .by_group = TRUE)
## # A tibble: 10 Ă— 6
## age_group age_order mean_cases mean_deaths mean_case_rate mean_death_rate
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 75+ Years 10 549. 108. 128. 23.2
## 2 65 - 74 Years 9 631. 54.3 119. 9.30
## 3 50 - 64 Years 8 1219. 36.2 125. 3.29
## 4 40 - 49 Years 7 913. 15.6 151. 2.19
## 5 30 - 39 Years 6 947. 10.9 157. 1.03
## 6 18 - 29 Years 5 1230. 8 149. 0.552
## 7 16 - 17 Years 4 200. NaN 152. NaN
## 8 12 - 15 Years 3 280. NaN 116. NaN
## 9 5 - 11 Years 2 401. NaN 104. NaN
## 10 0 - 4 Years 1 253. NaN 101. NaN
# Random Sample 3 - Age Groups - Summary Statistics
random_df_3 |>
group_by(age_group) |>
summarise(age_order = mean(age_order),
mean_cases = mean(case_count, na.rm = TRUE),
mean_deaths = mean(death_count, na.rm = TRUE),
mean_case_rate = mean(case_rate, na.rm = TRUE),
mean_death_rate = mean(death_rate, na.rm = TRUE)
) |>
arrange(desc(age_order), .by_group = TRUE)
## # A tibble: 10 Ă— 6
## age_group age_order mean_cases mean_deaths mean_case_rate mean_death_rate
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 75+ Years 10 540. 110. 126. 22.2
## 2 65 - 74 Years 9 598. 45.3 116. 8.33
## 3 50 - 64 Years 8 1096. 40.1 128. 3.83
## 4 40 - 49 Years 7 941. 15.4 148. 1.92
## 5 30 - 39 Years 6 878. 10.5 151. 1.32
## 6 18 - 29 Years 5 1262. 7.3 150. 0.416
## 7 16 - 17 Years 4 196. NaN 143. NaN
## 8 12 - 15 Years 3 280. NaN 114. NaN
## 9 5 - 11 Years 2 418. NaN 106. NaN
## 10 0 - 4 Years 1 245. NaN 95.3 NaN
# Random Sample 4 - Age Groups - Summary Statistics
random_df_4 |>
group_by(age_group) |>
summarise(age_order = mean(age_order),
mean_cases = mean(case_count, na.rm = TRUE),
mean_deaths = mean(death_count, na.rm = TRUE),
mean_case_rate = mean(case_rate, na.rm = TRUE),
mean_death_rate = mean(death_rate, na.rm = TRUE)
) |>
arrange(desc(age_order), .by_group = TRUE)
## # A tibble: 10 Ă— 6
## age_group age_order mean_cases mean_deaths mean_case_rate mean_death_rate
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 75+ Years 10 530. 126. 125. 22.9
## 2 65 - 74 Years 9 510. 48.4 113. 9.29
## 3 50 - 64 Years 8 1114. 42.2 124. 3.57
## 4 40 - 49 Years 7 917. 16.7 144. 2.09
## 5 30 - 39 Years 6 1040. 9.93 163. 0.997
## 6 18 - 29 Years 5 1224. 8.08 153. 0.281
## 7 16 - 17 Years 4 212. NaN 145. NaN
## 8 12 - 15 Years 3 321. NaN 126. NaN
## 9 5 - 11 Years 2 459. NaN 108. NaN
## 10 0 - 4 Years 1 289 NaN 88.1 NaN
# Random Sample 5 - Age Groups - Summary Statistics
random_df_5 |>
group_by(age_group) |>
summarise(age_order = mean(age_order),
mean_cases = mean(case_count, na.rm = TRUE),
mean_deaths = mean(death_count, na.rm = TRUE),
mean_case_rate = mean(case_rate, na.rm = TRUE),
mean_death_rate = mean(death_rate, na.rm = TRUE)
) |>
arrange(desc(age_order), .by_group = TRUE)
## # A tibble: 10 Ă— 6
## age_group age_order mean_cases mean_deaths mean_case_rate mean_death_rate
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 75+ Years 10 538. 113. 128. 22.9
## 2 65 - 74 Years 9 589. 55.2 113. 8.75
## 3 50 - 64 Years 8 1065. 37.9 115. 3.54
## 4 40 - 49 Years 7 867. 16.4 144. 2.11
## 5 30 - 39 Years 6 933. 11.0 152. 1.12
## 6 18 - 29 Years 5 1140. 7.62 148. 0.37
## 7 16 - 17 Years 4 214. NaN 150. NaN
## 8 12 - 15 Years 3 336. NaN 119. NaN
## 9 5 - 11 Years 2 343. NaN 98.1 NaN
## 10 0 - 4 Years 1 244. NaN 93.3 NaN
In general, the mean statistics by age group for each random sample are similar to each other and similar to the complete dataset, though there is some variation among the samples. The same general patterns seen in the complete dataset are seen in the samples, though specific values might vary.
Let’s calculate the summary statistics by sex at birth for the complete dataset.
# Complete Dataset - Sex At Birth - Summary Statistics
weekly_data |>
group_by(sex) |>
summarise(mean_cases = mean(case_count, na.rm = TRUE),
mean_deaths = mean(death_count, na.rm = TRUE),
mean_case_rate = mean(case_rate, na.rm = TRUE),
mean_death_rate = mean(death_rate, na.rm = TRUE)
)
## # A tibble: 2 Ă— 5
## sex mean_cases mean_deaths mean_case_rate mean_death_rate
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Female 734. 62.1 137. 9.32
## 2 Male 602. 61.5 120. 12.8
Let’s compare the random samples in terms of their summary statistics by sex at birth.
# Random Sample 1 - Sex At Birth - Summary Statistics
random_df_1 |>
group_by(sex) |>
summarise(mean_cases = mean(case_count, na.rm = TRUE),
mean_deaths = mean(death_count, na.rm = TRUE),
mean_case_rate = mean(case_rate, na.rm = TRUE),
mean_death_rate = mean(death_rate, na.rm = TRUE)
)
## # A tibble: 2 Ă— 5
## sex mean_cases mean_deaths mean_case_rate mean_death_rate
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Female 736. 56.5 139. 9.62
## 2 Male 607. 64.1 119. 12.3
# Random Sample 2 - Sex At Birth - Summary Statistics
random_df_2 |>
group_by(sex) |>
summarise(mean_cases = mean(case_count, na.rm = TRUE),
mean_deaths = mean(death_count, na.rm = TRUE),
mean_case_rate = mean(case_rate, na.rm = TRUE),
mean_death_rate = mean(death_rate, na.rm = TRUE)
)
## # A tibble: 2 Ă— 5
## sex mean_cases mean_deaths mean_case_rate mean_death_rate
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Female 774. 66.5 140. 8.64
## 2 Male 608. 58.2 121. 13.5
# Random Sample 3 - Sex At Birth - Summary Statistics
random_df_3 |>
group_by(sex) |>
summarise(mean_cases = mean(case_count, na.rm = TRUE),
mean_deaths = mean(death_count, na.rm = TRUE),
mean_case_rate = mean(case_rate, na.rm = TRUE),
mean_death_rate = mean(death_rate, na.rm = TRUE)
)
## # A tibble: 2 Ă— 5
## sex mean_cases mean_deaths mean_case_rate mean_death_rate
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Female 736. 60.6 139. 8.64
## 2 Male 610. 59.6 118. 12.3
# Random Sample 4 - Sex At Birth - Summary Statistics
random_df_4 |>
group_by(sex) |>
summarise(mean_cases = mean(case_count, na.rm = TRUE),
mean_deaths = mean(death_count, na.rm = TRUE),
mean_case_rate = mean(case_rate, na.rm = TRUE),
mean_death_rate = mean(death_rate, na.rm = TRUE)
)
## # A tibble: 2 Ă— 5
## sex mean_cases mean_deaths mean_case_rate mean_death_rate
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Female 731. 60.8 139. 9.14
## 2 Male 626. 73.8 120. 13.1
# Random Sample 5 - Sex At Birth - Summary Statistics
random_df_5 |>
group_by(sex) |>
summarise(mean_cases = mean(case_count, na.rm = TRUE),
mean_deaths = mean(death_count, na.rm = TRUE),
mean_case_rate = mean(case_rate, na.rm = TRUE),
mean_death_rate = mean(death_rate, na.rm = TRUE)
)
## # A tibble: 2 Ă— 5
## sex mean_cases mean_deaths mean_case_rate mean_death_rate
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Female 698. 61.9 134. 9.60
## 2 Male 588. 66.7 118. 12.6
Again, in general, the mean statistics by sex at birth for each random sample are similar to each other and similar to the complete dataset, though there is some variation in the samples.
Let’s calculate the summary statistics by race/ethnicity group for the complete dataset.
# Complete Dataset - Race/Ethnicity Group - Summary Statistics
weekly_data |>
group_by(race_ethnicity) |>
summarise(mean_cases = mean(case_count, na.rm = TRUE),
mean_deaths = mean(death_count, na.rm = TRUE),
mean_case_rate = mean(case_rate, na.rm = TRUE),
mean_death_rate = mean(death_rate, na.rm = TRUE)
)
## # A tibble: 5 Ă— 5
## race_ethnicity mean_cases mean_deaths mean_case_rate mean_death_rate
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 AI/AN, NH 26.1 7.2 217. 88.0
## 2 Asian/PI, NH 102. 10.4 95.7 24.7
## 3 Black, NH 368. 24.8 115. 12.6
## 4 Hispanic 318. 16.2 137. 19.5
## 5 White, NH 2220. 106. 110. 5.72
Let’s compare the random samples in terms of their summary statistics by race/ethnicity group.
# Random Sample 1 - Race/Ethnicity - Summary Statistics
random_df_1 |>
group_by(race_ethnicity) |>
summarise(mean_cases = mean(case_count, na.rm = TRUE),
mean_deaths = mean(death_count, na.rm = TRUE),
mean_case_rate = mean(case_rate, na.rm = TRUE),
mean_death_rate = mean(death_rate, na.rm = TRUE)
)
## # A tibble: 5 Ă— 5
## race_ethnicity mean_cases mean_deaths mean_case_rate mean_death_rate
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 AI/AN, NH 26.6 6 220. 60.2
## 2 Asian/PI, NH 109. 9.58 96.3 22.5
## 3 Black, NH 371. 26.1 117. 13.5
## 4 Hispanic 305. 18.1 132. 22.2
## 5 White, NH 2234. 97.9 113. 5.24
# Random Sample 2 - Race/Ethnicity - Summary Statistics
random_df_2 |>
group_by(race_ethnicity) |>
summarise(mean_cases = mean(case_count, na.rm = TRUE),
mean_deaths = mean(death_count, na.rm = TRUE),
mean_case_rate = mean(case_rate, na.rm = TRUE),
mean_death_rate = mean(death_rate, na.rm = TRUE)
)
## # A tibble: 5 Ă— 5
## race_ethnicity mean_cases mean_deaths mean_case_rate mean_death_rate
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 AI/AN, NH 26.7 NaN 224. NaN
## 2 Asian/PI, NH 104. 9.70 95.3 23.7
## 3 Black, NH 366. 24.1 116. 12.1
## 4 Hispanic 331. 17.2 142. 22.9
## 5 White, NH 2278. 105. 112. 5.66
# Random Sample 3 - Race/Ethnicity - Summary Statistics
random_df_3 |>
group_by(race_ethnicity) |>
summarise(mean_cases = mean(case_count, na.rm = TRUE),
mean_deaths = mean(death_count, na.rm = TRUE),
mean_case_rate = mean(case_rate, na.rm = TRUE),
mean_death_rate = mean(death_rate, na.rm = TRUE)
)
## # A tibble: 5 Ă— 5
## race_ethnicity mean_cases mean_deaths mean_case_rate mean_death_rate
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 AI/AN, NH 26.6 NaN 214. NaN
## 2 Asian/PI, NH 103. 10.1 98.2 23.6
## 3 Black, NH 384. 22.7 115. 11.2
## 4 Hispanic 315. 16.1 136. 19.3
## 5 White, NH 2174. 105. 108. 5.71
# Random Sample 4 - Race/Ethnicity - Summary Statistics
random_df_4 |>
group_by(race_ethnicity) |>
summarise(mean_cases = mean(case_count, na.rm = TRUE),
mean_deaths = mean(death_count, na.rm = TRUE),
mean_case_rate = mean(case_rate, na.rm = TRUE),
mean_death_rate = mean(death_rate, na.rm = TRUE)
)
## # A tibble: 5 Ă— 5
## race_ethnicity mean_cases mean_deaths mean_case_rate mean_death_rate
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 AI/AN, NH 25.8 7.5 212. 70.8
## 2 Asian/PI, NH 109. 10.2 98.2 24.6
## 3 Black, NH 371. 21.3 115. 9.62
## 4 Hispanic 308. 16.2 135. 20.7
## 5 White, NH 2259. 123. 117. 6.85
# Random Sample 5 - Race/Ethnicity - Summary Statistics
random_df_5 |>
group_by(race_ethnicity) |>
summarise(mean_cases = mean(case_count, na.rm = TRUE),
mean_deaths = mean(death_count, na.rm = TRUE),
mean_case_rate = mean(case_rate, na.rm = TRUE),
mean_death_rate = mean(death_rate, na.rm = TRUE)
)
## # A tibble: 5 Ă— 5
## race_ethnicity mean_cases mean_deaths mean_case_rate mean_death_rate
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 AI/AN, NH 25.5 7 211. 47.5
## 2 Asian/PI, NH 99.9 9.83 92.2 24.1
## 3 Black, NH 364. 24.8 115. 12.7
## 4 Hispanic 301. 15.9 130. 19.0
## 5 White, NH 2140. 112. 110. 6.14
Once again, the mean statistics by race/ethnicity group for the random samples are similar to the complete dataset and similar to each other, though there is minor variation among the samples.
Having just examined a small number of very large random samples, let’s try the opposite approach: examining a very large number of (relatively) small random samples.
We’ll create a Monte Carlo simulation to select numerous random samples, and then use that sampling distribution to predict a statistic for our population.
We’ll treat the complete dataset as our “population” and generate numerous small random samples from that population. For our statistic, let’s focus on the mean weekly case rate, which represents the average number of COVID-19 cases per 100,000 population.
mean(weekly_data$case_rate, na.rm = TRUE)
## [1] 128.4466
The mean weekly case rate for the population dataset was approximately 128 cases per 100,000 population.
Now let’s create the Monte Carlo simulation. First, let’s define a
function that will randomly generate m
number of samples of
size n
from the the population dataset.
# Function to simulate m random samples of size n from population dataset
sim_sampling <- function(n, m) {
sim <- function (x) mean(sample_n(weekly_data, n, replace = TRUE)$case_rate, na.rm = TRUE)
x <- map_dbl(1:m, sim)
df <- data.frame(
iter = 1:m,
sample_mean = x
)
return(df)
}
Now let’s use that function to generate 10,000 random samples of size 50, and then plot a distribution of the mean case rates from those samples.
# Calculate population mean to display on plot for comparison
pop_mean <- mean(weekly_data$case_rate, na.rm = TRUE)
# Set parameters for Monte Carlo simulation
n <- 50 # sample size
m <- 10000 # number of samples
# Run Monte Carlo simulation and plot results
mcsim_df <- sim_sampling(n, m)
mcsim_df |>
ggplot() +
geom_histogram(mapping = aes(x = sample_mean), binwidth = 2, fill = "#3182bd", colour = "white") +
geom_vline(xintercept = pop_mean, color = "orange") +
annotate("text", x = 155, y = 475, label = "Population Mean", color = 'orange') +
scale_y_continuous(breaks = c(0, 100, 200, 300, 400, 500, 600)) +
scale_x_continuous(breaks = c(0, 20, 40, 60, 80, 100, 120, 140, 160, 180, 200)) +
labs(title = paste(m, "Samples of Size", n, "from Dataset"),
x = "Sample Mean",
y = "Number of Samples") +
theme_hc()
The sampling distribution from our Monte Carlo simulation appears to approximate a normal distribution with its peak similar close to the value of the population mean.
The Central Limit Theorem states that the sampling distribution of a sufficiently large number of independent samples will be approximately normal, regardless of the the population’s actual distribution shape. Furthermore, the mean of the sample means will equal the population mean.
In fact, the calculations below show that the mean of the sample means is very close to the true population mean.
cat("Mean of Sample Means =", mean(mcsim_df$sample_mean), "\n")
## Mean of Sample Means = 128.3196
cat("Actual Population Mean =", pop_mean)
## Actual Population Mean = 128.4466
This Monte Carlo simulation helps demonstrate that a sufficiently large number of relatively small independent random samples of a population can be used to accurately predict the theoretical expected value of a statistic for the population.
This dataset will be explored further in subsequent data dives.