This data dive focuses on time series modeling using a CDC dataset that contains weekly counts and rates of COVID-19 cases and deaths reported in Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, and Wisconsin) of the United States from March 7, 2020 through November 18, 2023.
To get started, we’ll load several R packages to assist with our analysis.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(xts)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'xts'
##
## The following objects are masked from 'package:dplyr':
##
## first, last
library(tsibble)
##
## Attaching package: 'tsibble'
##
## The following object is masked from 'package:zoo':
##
## index
##
## The following object is masked from 'package:lubridate':
##
## interval
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
theme_set(theme_minimal())
options(scipen = 6)
Next, let’s read in the dataset from CSV.
covid <- read_delim("./COVID_weekly_cases_deaths_region5.csv", delim = ",")
## Rows: 37867 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): end_of_week, jurisdiction, age_group, sex, race_ethnicity_combined
## dbl (4): case_count_suppressed, death_count_suppressed, case_crude_rate_supp...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
IMPORTANT: We’ll need to convert
end_of_week from a character format to a date format for
our time series modeling.
# convert date from char into date format
covid$end_of_week <- as.Date(covid$end_of_week, format="%m/%d/%y")
In time series modeling, a time-based variable is used as the explanatory variable for another variable, i.e., the response variable.
This dataset has weekly summary data for COVID-19 cases and deaths reported from March 7, 2020 through November 18, 2023. Each row has a date representing the end of a week, which will be used as the explanatory variable for the time series modleing.
Each week is actually represented by multiple rows in the dataset: each of the rows for a specific week contains data for a specific demographic subgroup.
Each week also has several “Overall” rows that summarize the weekly data by age group, sex, or race/ethnicity group, as well as an “Overall” row that summarizes the weekly data across all demographics. We’ll use this latter summary row as the data to represent a given week.
This dataset has several variables that could be used as a response variable for time series modeling. The data includes weekly counts of COVID-19 cases and weekly counts of COVID-19 deaths. In addition, there are weekly case rates and weekly death rates, which are derived from the counts and account for population size.
We’ll use the weekly COVID-19 case count as the response variable for time series modeling.
In order to have a single value for each unit of time, we’ll need to filter the dataset to focus on the “Overall” summary of weekly data across all demographics, which will provide a single row of data per week.
# filter data to focus on single summary row ("Overall" for each demographic group) per week and then select end of week (date) and case count
weekly_cases <- covid |>
filter(age_group == "Overall" & sex == "Overall" & race_ethnicity_combined == "Overall") |>
select(end_of_week, case_count = case_count_suppressed) |>
arrange(end_of_week)
Next, we’ll create a tsibble object of this time-based dataset.
# tsibble
weekly_cases_ts <- as_tsibble(weekly_cases, index = end_of_week)
Let’s confirm the interval of this tsibble, which should be one week (7 days).
interval(weekly_cases_ts)
## <interval[1]>
## [1] 7D
Let’s also confirm whether the tsibble has any gaps (implicitly
missing data). If so, we’ll have to fill the gaps with NA
data.
# confirm whether tsibble has gaps to fill
count_gaps(weekly_cases_ts)
## # A tibble: 0 × 3
## # ℹ 3 variables: .from <date>, .to <date>, .n <int>
The tsibble has has no gaps to fill, so let’s next use the tsibble to create an “xts” time series data frame for statistical modeling.
# xts
weekly_cases_xts <- xts(x = weekly_cases_ts$case_count, order.by = weekly_cases_ts$end_of_week)
weekly_cases_xts <- setNames(weekly_cases_xts, "cases")
Let’s visualize the time series data with a line plot of the raw data.
# plot weekly case counts
weekly_cases_ts |>
ggplot() +
geom_line(mapping = aes(x = end_of_week, y = case_count)) +
ylim(0, 600000) +
labs(title = "COVID-19 Weekly Cases (Mar 2020 - Nov 2023)",
subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
x = "", y = "Cases") +
theme(plot.subtitle = element_text(colour = "darkgray")) +
theme_hc()
The line plot shows a pattern of large peaks and troughs, punctuated by smaller peaks and valleys within the weekly data.
Next let’s visualize the data using a 28-day rolling average (i.e., 4 weeks ~ one month), which should smooth out some of the variation seen in the plot of the raw data.
# plot 4 week (28 day) rolling average of weekly case counts
weekly_cases_xts %>%
rollapply(width = 4, \(x) mean(x), fill = FALSE) %>%
ggplot(mapping = aes(x = Index, y = cases)) +
geom_line() +
ylim(0, 600000) +
labs(title = "COVID-19 Weekly Cases (Mar 2020 - Nov 2023)",
subtitle = "28-Day Rolling Average for Region 5 (IL, IN, MI, MN, OH, WI)",
x = "", y = "Cases") +
theme(plot.subtitle = element_text(colour = "darkgray")) +
theme_hc()
The rolling average plot maintains the overall pattern of the first plot, while smoothing out some of the minor variation from week to week.
Next let’s visualize the data using LOESS (Locally Estimated
Scatterplot Smoothing) with a relative span value set to
0.1.
weekly_cases_ts |>
ggplot(mapping = aes(x = end_of_week, y = case_count)) +
geom_point(size = 1, color = 'gray') +
geom_smooth(method = 'loess', span = 0.1, color = 'blue', se = FALSE, na.rm = TRUE) +
ylim(0, 600000) +
labs(title = "COVID-19 Weekly Cases (Mar 2020 - Nov 2023)",
subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
x = "", y = "Cases") +
theme(plot.subtitle = element_text(colour = "darkgray")) +
theme_hc()
## `geom_smooth()` using formula = 'y ~ x'
This produces an even smoother plot, though it is dependent on the
specific value chosen for the span (which was 0.1 for this
plot). A smaller span value will follow the points more closely
(retaining more of the variation in the weekly data), while a larger
span value will smooth out more of the variation. We’ll return to this
later in the data dive as a way to identify seasons within the data.
Let’s use linear regression to detect trends in the data, i.e., an overall increase or decrease over a time period.
For this, it may help to subset the data to identify different trends that may occur within a larger time frame. Let’s try dividing the date ranges based on the overall peaks and troughs visible in the LOESS plot above.
First, let’s plot the trend from March 2020 to November 2020, the time period from the “start” of the pandemic until the first large peak (about 9 months).
# trend from March 2020 to November 2020 (start of pandemic to first peak)
weekly_cases_ts |>
filter_index("2020-03" ~ "2020-11") |>
ggplot(mapping = aes(x = end_of_week, y = case_count)) +
geom_line() +
geom_smooth(method = 'lm', color = 'blue', se = FALSE, na.rm = TRUE) +
ylim(0, 600000) +
scale_x_date(labels = \(x) paste(month(x, label = TRUE), year(x))) +
labs(title = "COVID-19 Weekly Cases (Mar 2020 - Nov 2020)",
subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
x = "", y = "Cases") +
theme(plot.subtitle = element_text(colour = "darkgray")) +
theme_hc()
## `geom_smooth()` using formula = 'y ~ x'
Next, let’s plot the trend from November 2020 through June 2021, the time period from the first major peak to the first major trough (about 8 months).
# trend from November 2020 to June 2021 (first peak to first trough)
weekly_cases_ts |>
filter_index("2020-11" ~ "2021-06") |>
ggplot(mapping = aes(x = end_of_week, y = case_count)) +
geom_line() +
geom_smooth(method = 'lm', color = 'blue', se = FALSE, na.rm = TRUE) +
ylim(0, 600000) +
scale_x_date(labels = \(x) paste(month(x, label = TRUE), year(x))) +
labs(title = "COVID-19 Weekly Cases (Nov 2020 - Jun 2021)",
subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
x = "", y = "Cases") +
theme(plot.subtitle = element_text(colour = "darkgray")) +
theme_hc()
## `geom_smooth()` using formula = 'y ~ x'
This next plot shows the trend from June 2021 through January 2022, the time period from the first trough to the second peak (about 7 months).
# trend from June 2021 to January 2022 (first trough to second peak)
weekly_cases_ts |>
filter_index("2021-06" ~ "2022-01") |>
ggplot(mapping = aes(x = end_of_week, y = case_count)) +
geom_line() +
geom_smooth(method = 'lm', color = 'blue', se = FALSE, na.rm = TRUE) +
ylim(0, 600000) +
scale_x_date(labels = \(x) paste(month(x, label = TRUE), year(x))) +
labs(title = "COVID-19 Weekly Cases (Jun 2021 - Jan 2022)",
subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
x = "", y = "Cases") +
theme(plot.subtitle = element_text(colour = "darkgray")) +
theme_hc()
## `geom_smooth()` using formula = 'y ~ x'
Finally, let plot the trend from January 2022 through November 2023, the time period from the second major peak to the end of the dataset (about 23 months).
# trend from January 2022 to November 2023 (second peak to end of dataset)
weekly_cases_ts |>
filter_index("2022-01" ~ "2023-11-18") |>
ggplot(mapping = aes(x = end_of_week, y = case_count)) +
geom_line() +
geom_smooth(method = 'lm', color = 'blue', se = FALSE, na.rm = TRUE) +
ylim(0, 600000) +
scale_x_date(labels = \(x) paste(month(x, label = TRUE), year(x))) +
labs(title = "COVID-19 Weekly Cases (Jan 2022 - Nov 2023)",
subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
x = "", y = "Cases") +
theme(plot.subtitle = element_text(colour = "darkgray")) +
theme_hc()
## `geom_smooth()` using formula = 'y ~ x'
This last trend plot seem to generally fit the data, but it is likely overlooking some less extreme peaks and valleys that occur within the last 23 month period of the dataset.
A season is defined as periodicity within the time series. One way to
look for seasonality is to group the data by larger time periods. We’ll
try grouping the data by month and plotting the average case count for
each month. Then we can use LOESS smoothing to plot a curve for this
grouped data. Again, the LOESS curve will be impacted by the chosen span
value (which is 0.25 in the plot below).
# plot average monthly cases over time
weekly_cases_ts |>
index_by(year = floor_date(end_of_week, 'month')) |>
summarise(avg_cases = mean(case_count, na.rm = TRUE)) |>
ggplot(mapping = aes(x = year, y = avg_cases)) +
geom_line() +
geom_smooth(method = 'loess', span = 0.25, color = 'blue', se = FALSE, na.rm = TRUE) +
ylim(0, 600000) +
scale_x_date(breaks = "6 months", labels = \(x) paste(month(x, label = TRUE), year(x))) +
labs(title = "COVID-19 Average Monthly Cases (Mar 2020 - Nov 2023)",
subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
x = "", y = "Average Cases") +
theme(plot.subtitle = element_text(colour = "darkgray")) +
theme_hc()
## `geom_smooth()` using formula = 'y ~ x'
The plot above appears to show a season of approximately 12 months (~52 weeks) from peak to peak (or from trough to trough). For example, the period from June 2020 to June 2021 could represent the length of one season (trough to trough), or the period from December 2020 to December 2021 could represent the length of one season (peak to peak).
However, the seasons in the plot above vary in amplitude: the second season has a higher peak, and then the third season has a much reduced amplitude. Fortunately, since this is data for an infectious disease, we don’t want the seasons to continue indefinitely. There are reasons why the seasonality might diminish with time: e.g., people will have acquired some level of immunity through vaccinations or past infections.
It also seems reasonable that a season would exist for this data and that it would follow a yearly cycle: COVID-19 is an infectious disease spread from person-to-person. Since people’s activities tend to follow a yearly pattern (such as: start of school year, holiday gatherings, summer vacations, etc.), it would seem reasonable that COVID-19 cases would track with activities that are more likely to result in the spread of COVID-19 (such as indoor end-of-year holiday gatherings).
Autocorrelation can be used to detect seasons in time series data by calculating the amount that a variable (in this case, our time-based variable) correlates with itself at different lags (i.e., past time periods).
If a predictable season exists within the data, an ACF (autocorrelation function) plot would help identify it.
acf(weekly_cases_ts, lag.max = 100, ci = 0.95, na.action = na.exclude, main = "ACF for Weekly Case Counts")
The ACF plot does show peaks and troughs, though they vary in amplitude. The first positive ACF peak occurs around lag 58. Since each lag represents one week, this is approximately equal to one year (52 weeks).
This dataset will be explored further in subsequent data dives.