The CDC dataset explored in this data dive contains weekly counts and rates of COVID-19 cases and deaths reported in Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, and Wisconsin) of the United States from March 7, 2020 through November 18, 2023.
To get started, let’s load tidyverse to assist with our data analysis.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Next, let’s read in the dataset from CSV.
covid <- read_delim("./COVID_weekly_cases_deaths_region5.csv", delim = ",")
## Rows: 37867 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): end_of_week, jurisdiction, age_group, sex, race_ethnicity_combined
## dbl (4): case_count_suppressed, death_count_suppressed, case_crude_rate_supp...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The data for “End of Week” will be converted from a character format to a date format to help with analyzing and visualizing the data properly.
covid$end_of_week <- as.Date(covid$end_of_week, format="%m/%d/%y")
Let’s rename some of the columns to make them more concise (for coding convenience).
covid <- covid |>
rename(race_ethnicity = race_ethnicity_combined,
case_count = case_count_suppressed,
death_count = death_count_suppressed,
case_rate = case_crude_rate_suppressed_per_100k,
death_rate = death_crude_rate_suppressed_per_100k
)
Finally, let’s add a new column to help order the age groups correctly, and then arrange all the rows in nested order by week, age, sex, and race/ethnicity.
# Add new column to order age groups correctly
covid <- covid |>
mutate(age_order = case_when(
age_group == "Overall" ~ 0L,
age_group == "0 - 4 Years" ~ 1L,
age_group == "5 - 11 Years" ~ 2L,
age_group == "12 - 15 Years" ~ 3L,
age_group == "16 - 17 Years" ~ 4L,
age_group == "18 - 29 Years" ~ 5L,
age_group == "30 - 39 Years" ~ 6L,
age_group == "40 - 49 Years" ~ 7L,
age_group == "50 - 64 Years" ~ 8L,
age_group == "65 - 74 Years" ~ 9L,
age_group == "75+ Years" ~ 10L),
.after = age_group) |>
arrange(end_of_week, age_order, sex, race_ethnicity)
Let’s create a dataframe for the overall totals for each week.
# Overall Totals for Each Week
weekly_totals_overall <- covid |>
filter(age_group == "Overall" & sex == "Overall" & race_ethnicity == "Overall")
# Display total counts for cases and deaths & mean rates for cases and deaths
weekly_totals_overall |>
summarise(cases = sum(case_count),
deaths = sum(death_count),
mean_case_rate = mean(case_rate),
mean_death_rate = mean(death_rate)
)
## # A tibble: 1 × 4
## cases deaths mean_case_rate mean_death_rate
## <dbl> <dbl> <dbl> <dbl>
## 1 16656806 182153 163. 1.79
The summary above shows the overall number of cases and deaths, as well as the overall mean weekly rates for cases and deaths.
The weekly rates are calculated from the weekly counts. The rate indicates the number of cases (or deaths) per 100,000 population.
These overall mean rates can serve as a reference point for identifying demographic subgroups that have a mean rate higher or lower than the overall mean rate.
Let’s create a dataframe of the weekly data by age group.
# Weekly Totals by Age Group
age_group_totals <- covid |>
filter(race_ethnicity == "Overall" & sex == "Overall" & age_group != "Overall")
# Display total counts for cases and deaths & mean rates for cases and deaths
age_group_totals |>
group_by(age_group) |>
summarise(age_order = mean(age_order),
cases = sum(case_count, na.rm = TRUE),
deaths = sum(death_count, na.rm = TRUE),
mean_case_rate = mean(case_rate, na.rm = TRUE),
mean_death_rate = mean(death_rate, na.rm = TRUE)
) |>
arrange(desc(age_order), .by_group = TRUE)
## # A tibble: 10 × 6
## age_group age_order cases deaths mean_case_rate mean_death_rate
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 75+ Years 10 1170147 104685 162. 14.6
## 2 65 - 74 Years 9 1337530 39601 133. 4.04
## 3 50 - 64 Years 8 3188286 28509 158. 1.53
## 4 40 - 49 Years 7 2299271 5597 188. 0.671
## 5 30 - 39 Years 6 2644463 2171 202. 0.353
## 6 18 - 29 Years 5 3273449 666 199. 0.154
## 7 16 - 17 Years 4 423796 0 162. NaN
## 8 12 - 15 Years 3 703991 0 135. NaN
## 9 5 - 11 Years 2 1005827 0 115. NaN
## 10 0 - 4 Years 1 588130 7 98.2 0.23
The summary above shows the total number of cases and deaths by age group, as well as their mean weekly rates for cases and deaths.
Let’s create a boxplot to compare the distributions of weekly case rates by age group.
# fct_reorder() reorders factor levels by sorting by another variable
# in this case, age_group will be sorted by age_order
age_group_totals |>
filter(!is.na(case_rate)) |>
ggplot() +
geom_boxplot(mapping = aes(x = case_rate, y = fct_reorder(age_group, age_order), fill = age_group), show.legend = FALSE) +
labs(title = "COVID-19 Weekly Case Rates by Age Group (Mar 2020 - Nov 2023)",
subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
x = "Weekly Case Rate", y = "Age Group") +
theme_minimal() +
theme(plot.subtitle = element_text(colour = "darkgray"))
The boxplot of weekly case rates shows that the median case rates were lower for age groups less than 18 years old.
Now let’s examine a boxplot of the weekly death rates by age group.
age_group_totals |>
filter(!is.na(death_rate)) |>
ggplot() +
geom_boxplot(mapping = aes(x = death_rate, y = fct_reorder(age_group, age_order), fill = age_group), show.legend = FALSE) +
labs(title = "COVID-19 Weekly Death Rates by Age Group (Mar 2020 - Nov 2023)",
subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
x = "Weekly Death Rate", y = "Age Group") +
theme_minimal() +
theme(plot.subtitle = element_text(colour = "darkgray"))
The boxplot of weekly death rates shows that the median death rates were relatively low for age groups less than 50 years old, but then start increasing with age:
Note: There were no deaths listed for any week in the dataset for ages 5-11, ages 12-15, or ages 16-17 (which is why these groups are not included in the boxplot above).
Let’s create a dataframe of the weekly data by sex at birth.
# Weekly Totals by Sex At Birth
sex_group_totals <- covid |>
filter(age_group == "Overall" & race_ethnicity == "Overall" & sex != "Overall")
# Display total counts for cases and deaths & mean rates for cases and deaths
sex_group_totals |>
group_by(sex) |>
summarise(cases = sum(case_count, na.rm = TRUE),
deaths = sum(death_count, na.rm = TRUE),
mean_case_rate = mean(case_rate, na.rm = TRUE),
mean_death_rate = mean(death_rate, na.rm = TRUE)
) |>
arrange(desc(deaths), .by_group = TRUE)
## # A tibble: 2 × 5
## sex cases deaths mean_case_rate mean_death_rate
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Male 7508422 97517 149. 1.95
## 2 Female 8950447 84475 173. 1.64
The summary above shows the total number of cases and deaths by sex at birth, as well as their mean weekly rates for cases and deaths.
Let’s create a boxplot to compare the distributions of weekly case rates by sex at birth.
sex_group_totals |>
filter(!is.na(case_rate)) |>
ggplot() +
geom_boxplot(mapping = aes(x = case_rate, y = sex, fill = sex), show.legend = FALSE) +
labs(title = "COVID-19 Weekly Case Rates by Sex (Mar 2020 - Nov 2023)",
subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
x = "Weekly Case Rate", y = "Sex At Birth") +
theme_minimal() +
theme(plot.subtitle = element_text(colour = "darkgray"))
The boxplot of weekly case rates shows that females tended to be more likely to get COVID-19:
Now let’s examine a boxplot of the weekly death rates by sex at birth.
sex_group_totals |>
filter(!is.na(death_rate)) |>
ggplot() +
geom_boxplot(mapping = aes(x = death_rate, y = sex, fill = sex), show.legend = FALSE) +
labs(title = "COVID-19 Weekly Death Rates by Sex (Mar 2020 - Nov 2023)",
subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
x = "Weekly Death Rate", y = "Sex At Birth") +
theme_minimal() +
theme(plot.subtitle = element_text(colour = "darkgray"))
The boxplot of weekly death rates shows a reversal of the pattern seen in weekly case rates. Males tended to be more likely to die from COVID-19, despite having a lower median case rate:
Let’s create a dataframe of the weekly data by race/ethnicity group.
# Weekly Totals by Race/Ethnicity Group
race_eth_group_totals <- covid |>
filter(age_group == "Overall" & sex == "Overall" & race_ethnicity != "Overall")
# Display total counts for cases and deaths & mean rates for cases and deaths
race_eth_group_totals |>
group_by(race_ethnicity) |>
summarise(cases = sum(case_count, na.rm = TRUE),
deaths = sum(death_count, na.rm = TRUE),
mean_case_rate = mean(case_rate, na.rm = TRUE),
mean_death_rate = mean(death_rate, na.rm = TRUE)
) |>
arrange(desc(deaths), .by_group = TRUE)
## # A tibble: 5 × 5
## race_ethnicity cases deaths mean_case_rate mean_death_rate
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 White, NH 8609844 134173 114. 1.78
## 2 Black, NH 1409467 20868 120. 1.95
## 3 Hispanic 1215287 8805 141. 1.43
## 4 Asian/PI, NH 366237 2580 94.3 1.18
## 5 AI/AN, NH 64169 479 146. 5.53
The summary above shows the total number of cases and deaths by race/ethnicity group, as well as their mean weekly rates for cases and deaths.
Let’s create a boxplot to compare the distributions of weekly case rates by race/ethnicity group.
race_eth_group_totals |>
filter(!is.na(case_rate)) |>
ggplot() +
geom_boxplot(mapping = aes(x = case_rate, y = race_ethnicity, fill = race_ethnicity), show.legend = FALSE) +
labs(title = "COVID-19 Weekly Case Rates by Race/Ethnicity (Mar 2020 - Nov 2023)",
subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
x = "Weekly Case Rate", y = "Race/Ethnicity") +
theme_minimal() +
theme(plot.subtitle = element_text(colour = "darkgray"))
The boxplot of weekly case rates shows differences by race/ethnicity group:
Now let’s examine a boxplot of the weekly death rates by race/ethnicity group.
race_eth_group_totals |>
filter(!is.na(death_rate)) |>
ggplot() +
geom_boxplot(mapping = aes(x = death_rate, y = race_ethnicity, fill = race_ethnicity), show.legend = FALSE) +
labs(title = "COVID-19 Weekly Death Rates by Race/Ethnicity (Mar 2020 - Nov 2023)",
subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
x = "Weekly Death Rate", y = "Race/Ethnicity") +
theme_minimal() +
theme(plot.subtitle = element_text(colour = "darkgray"))
The boxplot makes it obvious how much higher the median weekly death rate was for the American Indian/Alaska Native group. The entire distribution of the weekly death rates for this group is shifted higher in value, compared to all the other groups.
This boxplot of weekly death rates also shows a reversal of some of the differences seen among the weekly case rates by race/ethnicity group:
Based on the patterns seen in the data above, we could hypothesize which combinations of demographic groups would be expected to be more likely (or less likely) to have had cases of COVID-19 or to have died from COVID-19.
Persons predicted to be less likely to have had cases of COVID-19:
Persons predicted to be more likely to have had cases of COVID-19:
Persons predicted to be less likely to have died from COVID-19:
Persons predicted to be more likely to have died from COVID-19:
Let’s create a dataframe with two separate columns to flag these specific combinations of demographics for cases and deaths:
A column named flag_cases to indicate demographic
groups predicted to have had a lower incidence of cases (flag =
CL) or a higher incidence of cases (flag =
CH)
A column named flag_deaths to indicate demographic
groups predicted to have had a lower incidence of deaths (flag =
DL) or a higher incidence of deaths (flag =
DH)
# Define age groups for predictions
cases_younger <- c("0 - 4 Years", "5 - 11 Years", "12 - 15 Years")
cases_older <- c("16 - 17 Years", "18 - 29 Years", "30 - 39 Years", "40 - 49 Years", "50 - 64 Years", "65 - 74 Years", "75+ Years")
deaths_younger <- c("0 - 4 Years", "5 - 11 Years", "12 - 15 Years", "16 - 17 Years", "18 - 29 Years", "30 - 39 Years", "40 - 49 Years", "50 - 64 Years")
deaths_older <- c("65 - 74 Years", "75+ Years")
# Create dataframe with new column for flag_cases: CL = Cases Lower, CH = Cases Higher
covid_predictions <- covid |>
mutate(flag_cases = case_when(
age_group %in% cases_younger & sex == "Male"
& race_ethnicity %in% c("Asian/PI, NH", "Black, NH", "White, NH") ~ "CL",
age_group %in% cases_older & sex == "Female"
& race_ethnicity %in% c("AI/AN, NH", "Hispanic") ~ "CH"),
.after = race_ethnicity
)
# Add new column for flag_deaths: DL = Deaths Lower, DH = Deaths Higher
covid_predictions <- covid_predictions |>
mutate(flag_deaths = case_when(
age_group %in% deaths_younger & sex == "Female"
& race_ethnicity %in% c("Asian/PI, NH", "Hispanic") ~ "DL",
age_group %in% deaths_older & sex == "Male"
& race_ethnicity %in% c("AI/AN, NH", "Black, NH", "White, NH") ~ "DH"),
.after = flag_cases
)
# Replace numeric NA data with 0
covid_predictions <- covid_predictions %>% mutate_if(is.numeric, replace_na, replace = 0)
Let’s calculate the mean weekly rates for the demographic groups
predicted to have had a higher incidence of cases (flag =
CH) or a lower incidence of cases (flag =
CL).
# Display total counts for cases and deaths & mean rates for cases and deaths
covid_predictions |>
filter(!is.na(flag_cases)) |>
group_by(flag_cases) |>
summarise(mean_case_rate = mean(case_rate)
)
## # A tibble: 2 × 2
## flag_cases mean_case_rate
## <chr> <dbl>
## 1 CH 165.
## 2 CL 82.9
The group flagged as CH had a mean weekly case rate 2
times higher than the group flagged as CL.
Let’s create a boxplot to compare the distributions of weekly case
rates by flag_case group.
covid_predictions |>
filter(flag_cases %in% c("CL", "CH")) |>
ggplot() +
geom_boxplot(mapping = aes(x = case_rate, y = flag_cases, fill = flag_cases), show.legend = FALSE) +
labs(title = "COVID-19 Weekly Case Rates (Mar 2020 - Nov 2023)",
subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
x = "Weekly Case Rate", y = "Demographic Flag") +
theme_minimal() +
theme(plot.subtitle = element_text(colour = "darkgray"))
Let’s create a plot comparing the weekly case rates for the two
demographic groups predicted to be either more likely (CH)
or less likely (CL) to have had COVID-19.
library(ggthemes)
covid_predictions |>
filter(flag_cases %in% c("CL", "CH")) |>
group_by(end_of_week, flag_cases) |>
ggplot() +
geom_line(mapping = aes(x = end_of_week, y = case_rate, group = end_of_week, color = flag_cases)) +
theme_hc() +
labs(title = "COVID-19 Weekly Case Rates (Mar 2020 - Nov 2023)",
subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
x = "", y = "Case Rate", color = "") +
theme(plot.subtitle = element_text(colour = "darkgray")) +
theme(legend.position = "bottom")
Now let’s calculate the mean weekly rates for the demographic groups
predicted to have had a higher incidence of deaths (flag =
DH) or a lower incidence of deaths (flag =
DL).
covid_predictions |>
filter(!is.na(flag_deaths)) |>
group_by(flag_deaths) |>
summarise(mean_death_rate = mean(death_rate)
)
## # A tibble: 2 × 2
## flag_deaths mean_death_rate
## <chr> <dbl>
## 1 DH 8.07
## 2 DL 0.0802
The group flagged as DH had a mean weekly death rate
~100 times higher than of the group flagged as DL.
Let’s create a boxplot to compare the distributions of weekly death
rates by flag_death group.
covid_predictions |>
filter(flag_deaths %in% c("DL", "DH")) |>
ggplot() +
geom_boxplot(mapping = aes(x = death_rate, y = flag_deaths, fill = flag_deaths), show.legend = FALSE) +
labs(title = "COVID-19 Weekly Death Rates (Mar 2020 - Nov 2023)",
subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
x = "Weekly Death Rate", y = "Demographic Flag") +
theme_minimal() +
theme(plot.subtitle = element_text(colour = "darkgray"))
Now let’s create a plot comparing the weekly death rates for the two demographic groups that were predicted to be more likely or less likely to have died from COVID-19.
covid_predictions |>
filter(flag_deaths %in% c("DL", "DH")) |>
group_by(end_of_week, flag_deaths) |>
ggplot() +
geom_line(mapping = aes(x = end_of_week, y = death_rate, group = end_of_week, color = flag_deaths)) +
theme_hc() +
labs(title = "COVID-19 Weekly Death Rates (Mar 2020 - Nov 2023)",
subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
x = "", y = "Death Rate", color = "") +
theme(plot.subtitle = element_text(colour = "darkgray")) +
theme(legend.position = "bottom")
The visualizations are consistent with the predictions from the hypotheses, but further investigation will be needed to determine if the differences among the demographic groups are statistically significant.
Furthermore, it is not clear why these patterns occur - what physical or social factors might be causing higher or lower incidences of cases and deaths among these groups?
This dataset will be explored further in subsequent data dives.