COVID-19 Weekly Cases and Deaths by Age, Race/Ethnicity, and Sex (Mar 2020 - Nov 2023)

This data dive focuses on time series modeling using a CDC dataset that contains weekly counts and rates of COVID-19 cases and deaths reported in Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, and Wisconsin) of the United States from March 7, 2020 through November 18, 2023.


Load Libraries and Dataset

To get started, we’ll load several R packages to assist with our analysis.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(xts)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'xts'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
library(tsibble)
## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:zoo':
## 
##     index
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union
theme_set(theme_minimal())
options(scipen = 6)

Next, let’s read in the dataset from CSV.

covid <- read_delim("./COVID_weekly_cases_deaths_region5.csv", delim = ",")
## Rows: 37867 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): end_of_week, jurisdiction, age_group, sex, race_ethnicity_combined
## dbl (4): case_count_suppressed, death_count_suppressed, case_crude_rate_supp...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Convert Time-Based Data to Date Format

IMPORTANT: We’ll need to convert end_of_week from a character format to a date format for our time series modeling.

# convert date from char into date format
covid$end_of_week <- as.Date(covid$end_of_week, format="%m/%d/%y")

Time Series Modeling

In time series modeling, a time-based variable is used as the explanatory variable for another variable, i.e., the response variable.

Explanatory Variable = End of Week Date

This dataset has weekly summary data for COVID-19 cases and deaths reported from March 7, 2020 through November 18, 2023. Each row has a date representing the end of a week, which will be used as the explanatory variable for the time series modleing.

Each week is actually represented by multiple rows in the dataset: each of the rows for a specific week contains data for a specific demographic subgroup.

Each week also has several “Overall” rows that summarize the weekly data by age group, sex, or race/ethnicity group, as well as an “Overall” row that summarizes the weekly data across all demographics. We’ll use this latter summary row as the data to represent a given week.

Response Variable = Case Count

This dataset has several variables that could be used as a response variable for time series modeling. The data includes weekly counts of COVID-19 cases and weekly counts of COVID-19 deaths. In addition, there are weekly case rates and weekly death rates, which are derived from the counts and account for population size.

We’ll use the weekly COVID-19 case count as the response variable for time series modeling.

Create Dataset with Explanatory and Response Variables

In order to have a single value for each unit of time, we’ll need to filter the dataset to focus on the “Overall” summary of weekly data across all demographics, which will provide a single row of data per week.

# filter data to focus on single summary row ("Overall" for each demographic group) per week and then select end of week (date) and case count
weekly_cases <- covid |>
  filter(age_group == "Overall" & sex == "Overall" & race_ethnicity_combined == "Overall") |>
  select(end_of_week, case_count = case_count_suppressed) |>
  arrange(end_of_week)

Next, we’ll create a tsibble object of this time-based dataset.

# tsibble
weekly_cases_ts <- as_tsibble(weekly_cases, index = end_of_week)

Let’s confirm the interval of this tsibble, which should be one week (7 days).

interval(weekly_cases_ts)
## <interval[1]>
## [1] 7D

Let’s also confirm whether the tsibble has any gaps (implicitly missing data). If so, we’ll have to fill the gaps with NA data.

# confirm whether tsibble has gaps to fill
count_gaps(weekly_cases_ts)
## # A tibble: 0 × 3
## # ℹ 3 variables: .from <date>, .to <date>, .n <int>

The tsibble has has no gaps to fill, so let’s next use the tsibble to create an “xts” time series data frame for statistical modeling.

# xts
weekly_cases_xts <- xts(x = weekly_cases_ts$case_count, order.by = weekly_cases_ts$end_of_week)

weekly_cases_xts <- setNames(weekly_cases_xts, "cases")

Visualizing the Time Series

Let’s visualize the time series data with a line plot of the raw data.

# plot weekly case counts
weekly_cases_ts |>
  ggplot() +
  geom_line(mapping = aes(x = end_of_week, y = case_count)) +
  ylim(0, 600000) +
  labs(title = "COVID-19 Weekly Cases (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "", y = "Cases") +
  theme(plot.subtitle = element_text(colour = "darkgray")) +
  theme_hc()

The line plot shows a pattern of large peaks and troughs, punctuated by smaller peaks and valleys within the weekly data.

Rolling Average Plot

Next let’s visualize the data using a 28-day rolling average (i.e., 4 weeks ~ one month), which should smooth out some of the variation seen in the plot of the raw data.

# plot 4 week (28 day) rolling average of weekly case counts
weekly_cases_xts %>%
  rollapply(width = 4, \(x) mean(x), fill = FALSE) %>%
  ggplot(mapping = aes(x = Index, y = cases)) +
  geom_line() +
  ylim(0, 600000) +
  labs(title = "COVID-19 Weekly Cases (Mar 2020 - Nov 2023)",
       subtitle = "28-Day Rolling Average for Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "", y = "Cases") +
  theme(plot.subtitle = element_text(colour = "darkgray")) +
  theme_hc()

The rolling average plot maintains the overall pattern of the first plot, while smoothing out some of the minor variation from week to week.

LOESS Plot

Next let’s visualize the data using LOESS (Locally Estimated Scatterplot Smoothing) with a relative span value set to 0.1.

weekly_cases_ts |>
  ggplot(mapping = aes(x = end_of_week, y = case_count)) +
  geom_point(size = 1, color = 'gray') +
  geom_smooth(method = 'loess', span = 0.1, color = 'blue', se = FALSE, na.rm = TRUE) +
  ylim(0, 600000) +
  labs(title = "COVID-19 Weekly Cases (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "", y = "Cases") +
  theme(plot.subtitle = element_text(colour = "darkgray")) +
  theme_hc()
## `geom_smooth()` using formula = 'y ~ x'

This produces an even smoother plot, though it is dependent on the specific value chosen for the span (which was 0.1 for this plot). A smaller span value will follow the points more closely (retaining more of the variation in the weekly data), while a larger span value will smooth out more of the variation. We’ll return to this later in the data dive as a way to identify seasons within the data.

Seasons

A season is defined as periodicity within the time series. One way to look for seasonality is to group the data by larger time periods. We’ll try grouping the data by month and plotting the average case count for each month. Then we can use LOESS smoothing to plot a curve for this grouped data. Again, the LOESS curve will be impacted by the chosen span value (which is 0.25 in the plot below).

# plot average monthly cases over time
weekly_cases_ts |>
  index_by(year = floor_date(end_of_week, 'month')) |>
  summarise(avg_cases = mean(case_count, na.rm = TRUE)) |>
  ggplot(mapping = aes(x = year, y = avg_cases)) +
  geom_line() +
  geom_smooth(method = 'loess', span = 0.25, color = 'blue', se = FALSE, na.rm = TRUE) +
  ylim(0, 600000) +
  scale_x_date(breaks = "6 months", labels = \(x) paste(month(x, label = TRUE), year(x))) +
  labs(title = "COVID-19 Average Monthly Cases (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "", y = "Average Cases") +
  theme(plot.subtitle = element_text(colour = "darkgray")) +
  theme_hc()
## `geom_smooth()` using formula = 'y ~ x'

The plot above appears to show a season of approximately 12 months (~52 weeks) from peak to peak (or from trough to trough). For example, the period from June 2020 to June 2021 could represent the length of one season (trough to trough), or the period from December 2020 to December 2021 could represent the length of one season (peak to peak).

However, the seasons in the plot above vary in amplitude: the second season has a higher peak, and then the third season has a much reduced amplitude. Fortunately, since this is data for an infectious disease, we don’t want the seasons to continue indefinitely. There are reasons why the seasonality might diminish with time: e.g., people will have acquired some level of immunity through vaccinations or past infections.

It also seems reasonable that a season would exist for this data and that it would follow a yearly cycle: COVID-19 is an infectious disease spread from person-to-person. Since people’s activities tend to follow a yearly pattern (such as: start of school year, holiday gatherings, summer vacations, etc.), it would seem reasonable that COVID-19 cases would track with activities that are more likely to result in the spread of COVID-19 (such as indoor end-of-year holiday gatherings).

Autocorrelation

Autocorrelation can be used to detect seasons in time series data by calculating the amount that a variable (in this case, our time-based variable) correlates with itself at different lags (i.e., past time periods).

If a predictable season exists within the data, an ACF (autocorrelation function) plot would help identify it.

acf(weekly_cases_ts, lag.max = 100, ci = 0.95, na.action = na.exclude, main = "ACF for Weekly Case Counts")

The ACF plot does show peaks and troughs, though they vary in amplitude. The first positive ACF peak occurs around lag 58. Since each lag represents one week, this is approximately equal to one year (52 weeks).


This dataset will be explored further in subsequent data dives.