The dataset explored in this data dive comes from weekly data submitted to the Centers for Disease Control and Prevention (CDC) for COVID-19 cases and deaths for the week ending March 7, 2020 through the week ending November 18, 2023.
The CDC summarized weekly reports of cases and deaths (representing individual patients) into weekly counts and rates (i.e., number per 100,000 population) categorized by region, age group, race/ethnicity group, and sex at birth. Weekly counts with five or fewer cases or deaths were not included (i.e., were suppressed) in the dataset by the CDC, in order to protect the confidentiality of patients.
The original archived dataset was downloaded from the CDC Data Catalog and contains over 400,000 rows of data for the entire United States (including US territories), which are categorized into 10 regions by the Department of Health and Human Services (HHS).
⚠️ For the purposes of this data dive, the dataset has been reduced to only include data for Region 5, which consists of the states of Illinois, Indiana, Michigan, Minnesota, Ohio, and Wisconsin. This results in a revised data set of about 38,000 rows.
First, load tidyverse, which is a collection of R packages designed for data science.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Next, read in the dataset, which is saved as a CSV file.
covid <- read_delim("./COVID_weekly_cases_deaths_region5.csv", delim = ",")
## Rows: 37867 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): end_of_week, jurisdiction, age_group, sex, race_ethnicity_combined
## dbl (4): case_count_suppressed, death_count_suppressed, case_crude_rate_supp...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The dataset contains 37,867 rows and 9 columns: 5 columns of text
data (chr) and 4 columns of numeric data
(dbl).
Let’s view the full list of column names.
colnames(covid)
## [1] "end_of_week"
## [2] "jurisdiction"
## [3] "age_group"
## [4] "sex"
## [5] "race_ethnicity_combined"
## [6] "case_count_suppressed"
## [7] "death_count_suppressed"
## [8] "case_crude_rate_suppressed_per_100k"
## [9] "death_crude_rate_suppressed_per_100k"
The data for “End of Week” will be converted from a character format to a date format, which will help with certain analyses and visualizations.
class(covid$end_of_week)
## [1] "character"
covid$end_of_week <- as.Date(covid$end_of_week, format="%m/%d/%y")
class(covid$end_of_week)
## [1] "Date"
The COVID-19 dataset for Region 5 (IL, IN, MI, MN, OH, WI) provides weekly counts of cases and deaths, as well as weekly rates of cases and deaths.
The data include overall counts and rates for each week, as well as weekly breakdowns by age group, sex at birth, and race/ethnicity group, allowing for a detailed demographic view of how COVID-19 affected the population from March 2020 through November 18, 2023.
⚠️ The CDC included individuals with unknown or missing data for age, sex, or race/ethnicity in the overall weekly case and death counts, but those individuals are not represented in one or more demographic breakout groups (depending on which demographic data was unknown for that individual).
# Filter data to focus on Overall totals for each week
weekly_totals_overall <- covid |>
filter(age_group == "Overall" & sex == "Overall" & race_ethnicity_combined == "Overall")
# Summary of Weekly COVID Cases in Region 5 (March 7, 2020 - Nov 18, 2023)
summary(weekly_totals_overall$case_count_suppressed)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1086 26862 57134 85860 105618 573847
The overall weekly case counts have a wide distribution with a large difference between the median (~57K cases per week) and mean (~86K cases per week), indicating the data distribution is skewed.
The weekly case rates are calculated from the weekly case counts. The rate indicates the number of cases per 100,000 population. The overall rates are calculated for the total population.
# Summary of Weekly COVID Case Rates in Region 5 (March 7, 2020 - Nov 18, 2023)
summary(weekly_totals_overall$case_crude_rate_suppressed_per_100k)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.07 51.13 108.74 163.41 201.01 1092.17
When rates are calculated for a subgroup (such as Hispanic females age 30-39 years), the rate is calculated based on the population of that specific subgroup. Rates allow for an “apples to apples” comparison among subgroups, despite their differing population sizes.
# Summary of Weekly COVID Deaths in Region 5 (March 7, 2020 - Nov 18, 2023)
summary(weekly_totals_overall$death_count_suppressed)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.0 225.0 445.0 938.9 1308.0 4863.0
The overall weekly death counts also have a wide distribution with a large difference between the median (445 deaths per week) and mean (~939 deaths per week), indicating the data distribution is skewed.
The weekly death rates are calculated from the weekly death counts. The rate indicates the number of deaths per 100,000 population. The overall rates are calculated for the total population.
# Summary of Weekly COVID Death Rates in Region 5 (March 7, 2020 - Nov 18, 2023)
summary(weekly_totals_overall$death_crude_rate_suppressed_per_100k)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.020 0.430 0.850 1.787 2.490 9.260
# Filter data to focus on totals by Age Group per week (March 7, 2020 - Nov 18, 2023)
covid |>
filter(race_ethnicity_combined == "Overall" & sex == "Overall") |>
group_by(age_group) |>
summarise(case_count = sum(case_count_suppressed, na.rm = TRUE),
death_count = sum(death_count_suppressed, na.rm = TRUE)
) |>
arrange(desc(death_count), .by_group = TRUE)
## # A tibble: 11 × 3
## age_group case_count death_count
## <chr> <dbl> <dbl>
## 1 Overall 16656806 182153
## 2 75+ Years 1170147 104685
## 3 65 - 74 Years 1337530 39601
## 4 50 - 64 Years 3188286 28509
## 5 40 - 49 Years 2299271 5597
## 6 30 - 39 Years 2644463 2171
## 7 18 - 29 Years 3273449 666
## 8 0 - 4 Years 588130 7
## 9 12 - 15 Years 703991 0
## 10 16 - 17 Years 423796 0
## 11 5 - 11 Years 1005827 0
The summary data above is sorted by death count (highest to lowest), clearly showing an increase in the number of deaths with age group. For individuals under 18 years of age, there were only 7 total deaths from COVID-19 in Region 5 (IL, IN, MI, MN, OH, WI).
# Filter data to focus on totals by Sex per week (March 7, 2020 - Nov 18, 2023)
covid |>
filter(age_group == "Overall" & race_ethnicity_combined == "Overall") |>
group_by(sex) |>
summarise(case_count = sum(case_count_suppressed, na.rm = TRUE),
death_count = sum(death_count_suppressed, na.rm = TRUE)
) |>
arrange(desc(death_count), .by_group = TRUE)
## # A tibble: 3 × 3
## sex case_count death_count
## <chr> <dbl> <dbl>
## 1 Overall 16656806 182153
## 2 Male 7508422 97517
## 3 Female 8950447 84475
The summary data above shows that more females had cases of COVID-19, but more males actually died from COVID-19.
Although data for a patient’s Ethnicity and Race are typically collected as separate demographic questions, the CDC typically combines these fields for data analysis purposes by categorizing individuals as belong to a particular Race/Ethnicity group.
If an individual identifies as Hispanic/Latino for ethnicity, the CDC categorizes that individual’s Race/Ethnicity group as Hispanic/Latino, regardless of which race(s) that individual identifies with.
Hispanic will include
individuals who identify as Hispanic/Latino and White; individuals who
identify as Hispanic/Latino and Black/African American; individuals who
identify as Hispanic/Latino, Black/African American, and White;
etc.The CDC then defines the other groups as individuals who are Non-Hispanic and identify with a single race designation, such as:
American Indian/Alaska Native, Non-Hispanic
(AI/AN, NH)
Asian/Pacific Islander, Non-Hispanic
(Asian/PI, NH)
Black/African American, Non-Hispanic
(Black, NH)
White, Non-Hispanic (White, NH)
⚠️ If an individual identifies as Non-Hispanic and more than one race, the CDC will typically categorize the individual as Multiracial for data analysis purposes. However, this dataset does not include Multiracial as a breakout group within the Race/Ethnicity data. Instead, the CDC has included counts for individuals who identify as Non-Hispanic and multiple races as part of the overall weekly counts.
As a reminder, individuals with unknown or missing data for race and ethnicity are also included as part of the overall weekly counts but are also not represented as a breakout group for Race/Ethnicity.
Given that background, here are the counts for total cases and deaths by race/ethnicity.
# Filter data to focus on totals by Race/Ethnicity per week (March 7, 2020 - Nov 18, 2023)
covid |>
filter(age_group == "Overall" & sex == "Overall") |>
group_by(race_ethnicity_combined) |>
summarise(case_count = sum(case_count_suppressed, na.rm = TRUE),
death_count = sum(death_count_suppressed, na.rm = TRUE)
) |>
arrange(desc(death_count), .by_group = TRUE)
## # A tibble: 6 × 3
## race_ethnicity_combined case_count death_count
## <chr> <dbl> <dbl>
## 1 Overall 16656806 182153
## 2 White, NH 8609844 134173
## 3 Black, NH 1409467 20868
## 4 Hispanic 1215287 8805
## 5 Asian/PI, NH 366237 2580
## 6 AI/AN, NH 64169 479
To better understand how COVID-19 may have disproportionately impacted these race/ethnicity groups, it would be useful to compare their case rates and death rates because rates account for the differences in population sizes among groups.
However, even these counts for cases and deaths clearly reveal
disparities in the impact of COVID-19 among race/ethnicity groups. For
example, the groups for Black, NH and Hispanic
had relatively similar total case counts (~141K cases vs. ~121K cases,
respectively); however, the death counts for Black individuals (~21K
deaths) were more than double those of Hispanic
individuals (~9K deaths).
# How many deaths were for individuals who were Non-Hispanic multiple races or unknown race/ethnicity
deaths_overall <- sum(weekly_totals_overall$death_count_suppressed)
deaths_known_race_ethnicity <- covid |>
filter(age_group == "Overall" & sex == "Overall" & race_ethnicity_combined != "Overall")
deaths_unknown_race_ethnicity <- deaths_overall - sum(deaths_known_race_ethnicity$death_count_suppressed, na.rm = TRUE)
deaths_unknown_race_ethnicity
## [1] 15248
# What percentage of deaths were for individuals who were Non-Hispanic multiple races or unknown race/ethnicity
cat(round(deaths_unknown_race_ethnicity / deaths_overall * 100, digits = 1), "%")
## 8.4 %
Additionally, by calculating the difference in the overall death count compared to the sum of the death counts for the race/ethnicity breakout groups, we can determine that there were 15,248 deaths (8.4% of the overall deaths) that represent individuals who either identified as multiple, Non-Hispanic races or had unknown data for race/ethnicity.
This initial exploration of the dataset raises a number of questions to potentially explore further, including but not limited to:
What is the relationship between cases and deaths over time? For example, what is the lag time between an increase in cases and an subsequent increase in deaths? (How might this pattern be used to anticipate and reduce the number of deaths?)
What disparities exist in how COVID-19 affected different demographic groups of people over time? For example, which groups were most affected or least affected? (What social factors might be responsible for these differences, and how might we reduce the resulting disparities?)
How might the patterns found in the Region 5 data (IL, IN, MI, MN, OH, WI) compare with data for other regions of the United States? For example, did some regions have higher disparities or lower disparities among demographic groups? (How might these comparisons be used to identify ways to lower the impact or disparities across all regions?)
The data can be aggregrated (grouped) by sex (females vs. males) and plotted as a multi-line graph to explore how their patterns in weekly deaths compare over time.
library(ggthemes)
covid |>
filter(age_group == "Overall" & race_ethnicity_combined == "Overall" & sex != "Overall") |>
group_by(sex) |>
ggplot() +
geom_line(mapping = aes(x = end_of_week, y = death_count_suppressed, group = sex, color = sex), na.rm = TRUE) +
theme_hc() +
labs(title = "COVID-19 Deaths (March 7, 2020 - November 18, 2023)",
subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
x = "", y = "Death Count") +
theme(plot.subtitle = element_text(colour = "darkgray")) +
theme(legend.position = "bottom")
The graph shows the lines for weekly deaths by sex closely mirror each other over time; however, the number of deaths among males is typically higher, especially during peaks of deaths.
As seen in the previous graph, there were numerous peaks of increased weekly deaths. A histogram can be generated to better analyze the distribution of the overall weekly death counts.
median_death_count <- median(weekly_totals_overall$death_count_suppressed)
mean_death_count <- mean(weekly_totals_overall$death_count_suppressed)
weekly_totals_overall |>
ggplot() +
geom_histogram(mapping = aes(x = death_count_suppressed), binwidth = 100, color = 'white') +
geom_vline(xintercept = median_death_count, color = 'orange') +
geom_vline(xintercept = mean_death_count, color = 'red') +
labs(title = "COVID-19 Weekly Death Counts (March 7, 2020 - November 18, 2023)",
subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
x = "Weekly Death Count", y = "Number of Weeks") +
annotate("text", x = 120, y = 30, label = "Median", color = 'orange') +
annotate("text", x = 1300, y = 30, label = "Average", color = 'red') +
theme_classic() +
theme(plot.subtitle = element_text(colour = "darkgray"))
The histogram shows the distribution in weekly death counts is skewed, with most weeks having death counts towards the “lower” end (though these still represent hundreds of deaths weekly) and a long tail extending to the right, for weeks with much higher death counts (representing thousands of deaths weekly).
This dataset will be explored further in subsequent data dives.